macmini 손실 (재해 복구)

AI 요청 프롬프트


https://docs.axelabs.ai/ops/runbook/macmini-loss 따라 customer macmini 손실 1-day 복구 진행해줘.

진행:
1. 사건 확인 — customer admin 통보 (도난/화재/disk 파손) + 손실 시점 파악
2. 즉시 access 차단 — customer IT 측 Microsoft Entra ID app 비활성 또는 secret 회전 (모든 직원 connector 401)
3. backup 상태 확인 (Ring peer / Cold SSD 가장 신선 snapshot) → 손실 시점 ~ 마지막 backup 간 데이터 손실 범위 추정 + 사용자 확인
4. 새 macmini 확보 + 단기 복구 (T+1h~1d) 페이지 절차, 매 step 결과 받고 다음. customer IT 와 양방향 통보 + 진행 상황 공유
5. 복구 완료 후 health 검증 + 정합성 검사 + customer 회신 + 사고 회고 /ops/known-gaps 기록 (재발 방지)

본인 AI session = Claude Code / Cursor / ChatGPT 데스크탑 / Claude.app / 기타.

페이지 본문 = 사람이 직접 read 도 가능, AI 도 참고. AI 가 본 페이지 fetch 후 위 진행 순서대로 사용자와 step-by-step interactive 풀어나감.

가장 심각한 시나리오. customer macmini 가 도난/화재/disk 완전 파손으로 사용 불가.

즉시 (T+0 ~ T+1h)

1. 사건 확인

customer admin 으로부터 통보:

“사무실 화재로 macmini 손상”
“이전 후 부팅 안 됨”
“도난 신고”

2. 사용자 access 차단

악의적 사용 방지 — Microsoft Entra ID 측에서 app 비활성 (또는 secret 회전):


# customer IT 에게 즉시 요청:
"Azure portal → Microsoft Entra ID → App registrations → 'Frame MCP' → Authentication → 
 'Allow public client flows' 끄거나, Certificates & secrets 에서 secret 삭제"

→ 모든 직원의 connector 가 401 받기 시작.

3. 현재 backup 상태 확인


# Ring peer 측에서 가장 신선한 backup 확인
ssh realchoice-macmini "
  restic -r /Users/realchoice/peer-backups/axe/restic-repo snapshots \
    --password-file &lt;(security find-generic-password -w -s axe.backup.restic.local)
"
# 가장 최근 snapshot 시점 확인 → 손실 시간 추정
 
# Cold SSD 도 확인
# (운영자가 SSD 꽂아서 확인)

손실 시점 = (사건 직전 시점) ~ (마지막 backup 시점) 사이.

단기 복구 (T+1h ~ T+1d)

4. 새 macmini 확보

customer 가 새 macmini 구매 (M2/M3, 16GB+, 512GB+)
도착 즉시 운영자 측 배송 또는 customer IT 가 셋업

5. macOS 기본 셋업

customer IT 또는 운영자가 원격 지원:


# 1. macOS 사용자 생성 (예: axe)
# 2. Tailscale install + 운영자 key 등록
# 3. SSH 접속 검증
 
# 4. Docker Desktop install
# 5. axe CLI install
curl -sSL https://docs.axelabs.ai/install/axe-cli.sh | bash

6. backup 복원


# 운영자 머신에서
axe restore --customer &lt;customer&gt; --tier ring --from &lt;peer&gt; --as-of &lt;latest&gt;
 
# 또는 cold SSD
axe restore --customer &lt;customer&gt; --tier cold --as-of &lt;latest&gt;

복원 대상:

frame-postgres 데이터 (Docker volume)
.local/files/ (platform data, source files)
~/.frame/ (PII salt)
컨테이너 image 들 (재 build)

7. customers.yaml + cloudflared 그대로

운영자 머신의 customers.yaml 변경 불필요 — customer 의 도메인 ({customer}.axelabs.ai) 그대로 사용.

cloudflared ingress 도 그대로 — origin 이 host.docker.internal:port 이고 customer macmini IP 변경 무관 (Tailscale FQDN 사용 안 함).

8. onboard 재실행


axe onboard &lt;customer&gt; --apply
# Azure 측 설정은 손대지 않음 (onboard 가 Azure CRUD 안 함, 그대로 재실행 idempotent).
# 부분 재실행 필요하면 `--skip-frame` (frame 만 건너뛰기) 사용 가능.

Azure Entra ID app 들은 그대로 (도메인 동일, redirect_uri 동일). frame / blueprint / vault 만 macmini 측 fresh setup.

9. 정합성 검사


ssh &lt;customer&gt;-macmini "docker exec frame-mcp-blue python -m frame.cli integrity-check --entity &lt;each_entity&gt;"

모두 통과 → 복구 완료.

10. customer IT 측 Microsoft Entra ID 재활성

Allow public client flows 다시 Yes 또는 새 secret 발급 (3 절에서 차단했던 것 되돌리기).

새 secret 발급 시:

customer IT → 새 secret VALUE
운영자 → macmini Keychain 에 push
컨테이너 재기동
직원들에게 새 secret 안내 (claude.ai connector 측 swap 필요)

손실 분석

복원 후 손실 분석:


# 마지막 backup ~ 사건 사이의 audit_log
ssh &lt;customer&gt;-macmini "docker exec frame-postgres psql -U frame -d frame -c \"
  SELECT actor, op, table_name, ts
  FROM &lt;entity&gt;.audit_log
  WHERE ts > '&lt;last_backup_ts&gt;'
  ORDER BY ts;
\""

이 audit_log 자체도 backup 시점 이후 손실. 운영자가:

customer 에게 손실 시점 통지 + 그 사이 작업한 내역 재입력 요청
매뉴얼 분개 재입력 (회계사가 자기 메모 보고)
외부 증거 (은행 statement 등) 재수집

보안 검토 (사건 직후)

도난/화재가 아니라 침해 의심이면:

Microsoft Entra ID audit log 검토 (의심 IP, 의심 시간대)
frame audit_log 검토 (의심 actor)
ring peer 의 backup 도 봐서 사건 직전 정상 활동 확인
의심 시 사건 정식 보고 + 외부 감사

함정

함정	결과	회피
backup 만 신뢰, Microsoft 측 access 안 차단	도난자가 계속 access	즉시 customer IT 호출
backup 신선도 명확 확인 안 함	손실 시간 misestimate	매번 `restic snapshots` 의 `Time` 컬럼 명시
복원 후 정합성 검사 안 함	잠재 손상 미발견	`integrity-check` 필수
자세한 손실 시점 추정 안 함	customer 가 누락된 작업 모름	마지막 backup ts 명확히 통지

분기 Drill 의 의미

분기 자동 restore drill 은 정확히 이 시나리오 대비. drill 통과 = 복구 절차가 실제로 작동함을 검증.

drill 실패 = 즉시 backup 절차 점검. 실패 누적 = 인프라 전반 재검토.