# Qwen-Image-Layered 클라우드 전환 (4/10): Vertex AI Model Garden 배포 실전

## Google Cloud 준비

### 1. GCP 프로젝트 생성

[Google Cloud Console](https://console.cloud.google.com)에서:

```
1. 프로젝트 생성
   - 프로젝트 이름: poster-layer-decomposer
   - 프로젝트 ID: poster-decomposer-12345
   - 결제 계정 연결 (필수)

2. 결제 알림 설정 (중요!)
   - Billing → Budgets & alerts
   - 월 예산: $50
   - 80% 도달 시 이메일 알림
   - 100% 도달 시 서비스 중단 (선택)
```

### 2. Vertex AI API 활성화

```bash
# gcloud CLI 설치 (로컬 머신)
curl https://sdk.cloud.google.com | bash
exec -l $SHELL

# 로그인
gcloud auth login

# 프로젝트 설정
gcloud config set project poster-decomposer-12345

# Vertex AI API 활성화
gcloud services enable aiplatform.googleapis.com
gcloud services enable storage.googleapis.com
```

**또는 Console에서**:
```
APIs & Services → Enable APIs →
"Vertex AI API" 검색 → Enable
```

### 3. 서비스 계정 생성

```bash
# 서비스 계정 생성
gcloud iam service-accounts create vertex-ai-worker \
    --display-name="Vertex AI Worker"

# 권한 부여
gcloud projects add-iam-policy-binding poster-decomposer-12345 \
    --member="serviceAccount:vertex-ai-worker@poster-decomposer-12345.iam.gserviceaccount.com" \
    --role="roles/aiplatform.user"

# JSON 키 다운로드
gcloud iam service-accounts keys create vertex-ai-key.json \
    --iam-account=vertex-ai-worker@poster-decomposer-12345.iam.gserviceaccount.com
```

**키 파일 보안**:
```bash
chmod 600 vertex-ai-key.json
export GOOGLE_APPLICATION_CREDENTIALS="$(pwd)/vertex-ai-key.json"
```

## Vertex AI Model Garden 탐색

### 1. Console에서 접근

```
Google Cloud Console → Vertex AI → Model Garden
```

**화면 구성**:
- Google Models: Gemini, PaLM 등
- **Open Models on Hugging Face**: 🎯 여기로 이동
- Third-party Models: Anthropic Claude 등

### 2. Hugging Face 모델 검색

```
Open Models on Hugging Face → Show more (4000+ models)

검색창에 입력: "Qwen-Image-Layered"
```

**결과**:
- ❌ 직접 검색 시 나오지 않을 수 있음 (최신 모델)
- ✅ "Qwen"으로 검색하면 관련 모델 표시
- ⚠️ `Qwen-Image-Layered`가 목록에 없다면?

### 3. Qwen-Image-Layered가 없을 때

**대안 1: 직접 배포 (Notebooks 사용)**

Vertex AI는 Hugging Face의 **모든 모델**을 배포할 수 있지만, Model Garden UI에는 인기 모델만 표시된다.

최신 모델은 **수동 배포** 필요:

```python
# Vertex AI Notebook에서 실행
from google.cloud import aiplatform

# 1. Hugging Face 모델을 컨테이너로 래핑
from huggingface_hub import login
login(token="hf_xxxxx")

# 2. 모델을 Vertex AI용 컨테이너로 빌드
# (다음 섹션에서 상세)
```

## 수동 배포: Hugging Face DLC 사용

Google Cloud와 Hugging Face는 **Deep Learning Containers (DLC)**를 제공한다.

### 1. DLC란?

사전 구성된 Docker 이미지:
- PyTorch + Transformers + Diffusers 설치됨
- CUDA, cuDNN 최적화됨
- Vertex AI와 호환

**지원 DLC**:
- **TGI (Text Generation Inference)**: LLM용
- **TEI (Text Embedding Inference)**: 임베딩용
- **PyTorch Inference DLC**: 범용 (🎯 우리가 사용할 것)

### 2. 배포 스크립트

```python
# deploy_qwen_to_vertex.py
from google.cloud import aiplatform
from google.cloud.aiplatform import gapic

# 초기화
aiplatform.init(
    project="poster-decomposer-12345",
    location="us-central1"
)

# 1. 모델 업로드
model = aiplatform.Model.upload(
    display_name="qwen-image-layered",
    artifact_uri="gs://poster-decomposer-models/qwen",  # 나중에 업로드
    serving_container_image_uri="us-docker.pkg.dev/vertex-ai/prediction/pytorch-gpu.1-13:latest",
    serving_container_command=[
        "python", "-m", "inference_server"
    ],
    serving_container_environment_variables={
        "MODEL_ID": "Qwen/Qwen-Image-Layered",
        "TASK": "image-layering"
    }
)

# 2. 엔드포인트 생성
endpoint = aiplatform.Endpoint.create(
    display_name="qwen-image-layered-endpoint"
)

# 3. 모델 배포
model.deploy(
    endpoint=endpoint,
    deployed_model_display_name="qwen-v1",
    machine_type="n1-standard-4",
    accelerator_type="NVIDIA_TESLA_T4",
    accelerator_count=1,
    min_replica_count=0,  # 🎯 자동 스케일링
    max_replica_count=3
)

print(f"✅ 배포 완료: {endpoint.resource_name}")
```

### 문제점

**복잡도가 너무 높다**:
1. 커스텀 `inference_server.py` 작성 필요
2. 모델을 Cloud Storage에 업로드 필요 (15GB)
3. 컨테이너 커스터마이징 필요

→ **더 간단한 방법이 필요하다**

## 대안: Hugging Face Hub에서 직접 로드

Vertex AI의 PyTorch 컨테이너는 **런타임에 Hugging Face에서 모델 다운로드** 가능.

### 간소화된 방식

```python
# custom_predictor.py (Vertex AI 커스텀 예측기)
from google.cloud.aiplatform.prediction.predictor import Predictor
from diffusers import QwenImageLayeredPipeline
import torch
from PIL import Image
import base64
import io

class QwenLayeredPredictor(Predictor):
    def __init__(self):
        return

    def load(self, artifacts_uri: str):
        """모델 로딩 (엔드포인트 시작 시 1회)"""
        self.pipeline = QwenImageLayeredPipeline.from_pretrained(
            "Qwen/Qwen-Image-Layered",
            torch_dtype=torch.bfloat16
        )
        self.pipeline.to("cuda")
        print("✅ 모델 로딩 완료")

    def predict(self, instances):
        """추론 요청 처리"""
        results = []

        for instance in instances:
            # Base64 이미지 디코딩
            image_b64 = instance["image"]
            image_bytes = base64.b64decode(image_b64)
            image = Image.open(io.BytesIO(image_bytes))

            # 추론
            layers = self.pipeline(
                image=image,
                layers=instance.get("layers", 5),
                resolution=instance.get("resolution", 1024)
            )

            # 레이어를 Base64로 변환
            layers_b64 = []
            for layer in layers:
                buffer = io.BytesIO()
                layer.save(buffer, format="PNG")
                layer_b64 = base64.b64encode(buffer.getvalue()).decode()
                layers_b64.append(layer_b64)

            results.append({"layers": layers_b64})

        return results
```

### 패키징 및 배포

```bash
# 1. 프로젝트 구조
vertex-qwen/
├── custom_predictor.py
├── requirements.txt
└── setup.py

# requirements.txt
transformers>=4.51.3
diffusers>=0.30.0
accelerate
torch>=2.0

# 2. Cloud Storage 업로드
gsutil mb gs://poster-decomposer-models
gsutil cp -r vertex-qwen/ gs://poster-decomposer-models/qwen/

# 3. 배포
python deploy_qwen_to_vertex.py
```

### 예상 문제

**모델 다운로드 시간**:
- Qwen-Image-Layered: ~15GB
- 첫 인스턴스 시작 시 Hugging Face에서 다운로드
- 예상 시간: 5-10분

→ **Cold Start가 매우 길어진다**

## 실용적 접근: Prebuilt 이미지 사용

### 전략 변경

**문제**:
1. 커스텀 코드 복잡
2. 모델 다운로드 시간 길다
3. 디버깅 어려움

**해결**:
- 로컬에서 **완전히 작동하는 Docker 이미지**를 먼저 빌드
- 이 이미지를 Vertex AI에 배포
- 모델을 이미지에 포함 (빌드 타임 다운로드)

### Docker 이미지 빌드

```dockerfile
# Dockerfile
FROM pytorch/pytorch:2.1.0-cuda11.8-cudnn8-runtime

# Python dependencies
RUN pip install --no-cache-dir \
    transformers>=4.51.3 \
    diffusers>=0.30.0 \
    accelerate \
    flask \
    gunicorn

# 모델 다운로드 (빌드 시)
ENV HF_HOME=/models
RUN python -c "from diffusers import QwenImageLayeredPipeline; \
    QwenImageLayeredPipeline.from_pretrained('Qwen/Qwen-Image-Layered', cache_dir='/models')"

# 추론 서버
COPY server.py /app/
WORKDIR /app

EXPOSE 8080
CMD ["gunicorn", "--bind", "0.0.0.0:8080", "--timeout", "300", "server:app"]
```

```python
# server.py
from flask import Flask, request, jsonify
from diffusers import QwenImageLayeredPipeline
import torch
import base64
from PIL import Image
import io

app = Flask(__name__)

# 모델 로딩 (서버 시작 시)
print("Loading model...")
pipeline = QwenImageLayeredPipeline.from_pretrained(
    "Qwen/Qwen-Image-Layered",
    cache_dir="/models",
    torch_dtype=torch.bfloat16
)
pipeline.to("cuda")
print("Model loaded!")

@app.route("/health", methods=["GET"])
def health():
    return jsonify({"status": "healthy"})

@app.route("/predict", methods=["POST"])
def predict():
    data = request.json
    instances = data.get("instances", [])

    predictions = []
    for instance in instances:
        # 이미지 디코딩
        image_b64 = instance["image"]
        image_bytes = base64.b64decode(image_b64)
        image = Image.open(io.BytesIO(image_bytes))

        # 추론
        layers = pipeline(
            image=image,
            layers=instance.get("layers", 5),
            resolution=instance.get("resolution", 1024)
        )

        # 인코딩
        layers_b64 = []
        for layer in layers:
            buffer = io.BytesIO()
            layer.save(buffer, format="PNG")
            layer_b64 = base64.b64encode(buffer.getvalue()).decode()
            layers_b64.append(layer_b64)

        predictions.append({"layers": layers_b64})

    return jsonify({"predictions": predictions})

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=8080)
```

### 이미지 빌드 및 푸시

```bash
# 1. 빌드 (시간 오래 걸림, 모델 다운로드 포함)
docker build -t qwen-image-layered:v1 .

# 2. Google Container Registry 푸시
docker tag qwen-image-layered:v1 gcr.io/poster-decomposer-12345/qwen-image-layered:v1
docker push gcr.io/poster-decomposer-12345/qwen-image-layered:v1

# 3. Vertex AI에 배포
gcloud ai endpoints create \
    --region=us-central1 \
    --display-name=qwen-layered

gcloud ai models upload \
    --region=us-central1 \
    --display-name=qwen-layered-v1 \
    --container-image-uri=gcr.io/poster-decomposer-12345/qwen-image-layered:v1 \
    --container-health-route=/health \
    --container-predict-route=/predict \
    --container-ports=8080

gcloud ai endpoints deploy-model ENDPOINT_ID \
    --region=us-central1 \
    --model=MODEL_ID \
    --display-name=qwen-v1 \
    --machine-type=n1-standard-4 \
    --accelerator=type=nvidia-tesla-t4,count=1 \
    --min-replica-count=0 \
    --max-replica-count=3
```

## API 호출 테스트

```python
# test_vertex_ai.py
from google.cloud import aiplatform
import base64

# 초기화
aiplatform.init(
    project="poster-decomposer-12345",
    location="us-central1"
)

# 엔드포인트 가져오기
endpoint = aiplatform.Endpoint("projects/.../endpoints/...")

# 이미지 로드
with open("test_poster.jpg", "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode()

# 예측 요청
response = endpoint.predict(
    instances=[{
        "image": image_b64,
        "layers": 5,
        "resolution": 1024
    }]
)

# 결과 처리
layers = response.predictions[0]["layers"]
for i, layer_b64 in enumerate(layers):
    with open(f"layer_{i}.png", "wb") as f:
        f.write(base64.b64decode(layer_b64))

print(f"✅ {len(layers)} 레이어 저장 완료")
```

## 결과 및 학습

### 성공 지표

```
배포 시간: ~15분 (Docker 빌드 포함)
첫 호출 (Cold Start): ~2분
이후 호출 (Warm): ~30초
비용: $0.0075 × 0.5분 = ~$0.004/요청
```

### 배운 점

1. **Vertex AI는 유연하지만 복잡하다**
   - Model Garden UI는 인기 모델만
   - 커스텀 모델은 수동 배포 필요

2. **Docker 이미지가 핵심**
   - 모델을 이미지에 포함하면 Cold Start 단축
   - 이미지 크기 증가 (~20GB)는 감수

3. **트레이드오프**
   - 배포 복잡도 ↑
   - 운영 비용 ↓

## 다음 단계

v5에서는 **API 엔드포인트를 기존 FastAPI 백엔드에 통합**한다:
1. 로컬 GPU 코드 제거
2. Vertex AI 클라이언트로 교체
3. 비동기 처리 개선
4. 에러 핸들링 강화

이제 클라우드 인프라가 준비되었으니, 애플리케이션 레벨로 돌아간다.

---

**이전 글**: [Hugging Face API 테스트 (3/10)](./update-qwen-image-layered-project-v3.md)

**다음 글**: [API 엔드포인트 재설계 (5/10)](./update-qwen-image-layered-project-v5.md)

**참고 자료**:
- [Deploy from Hugging Face Hub to Vertex AI](https://huggingface.co/blog/alvarobartt/deploy-from-hub-to-vertex-ai)
- [Vertex AI Custom Prediction Routines](https://cloud.google.com/vertex-ai/docs/predictions/custom-prediction-routines)
- [Google Cloud Deep Learning Containers](https://cloud.google.com/deep-learning-containers/docs)