# Qwen-Image-Layered 클라우드 전환 (6/10): 비용 최적화 전략

## 현재 비용 구조

v5에서 구축한 시스템의 예상 비용:

```
GPU: T4 (16GB)
시간당: $0.45
추론 시간: ~30초

1회당 비용:
$0.45 × (30/3600) = $0.00375 (약 ₩5)

월 사용량별:
- 1,000회: $3.75
- 10,000회: $37.50
- 100,000회: $375
```

**추가 비용**:
- 네트워크: ~$0.0002/요청
- Storage: ~$0.30/월 (모델 저장)

## 문제: 유휴 시간 비용

현재 설정:
```python
min_replica_count=0  # 요청 없을 때 자동 종료
max_replica_count=3
```

**Cold Start 문제**:
```
첫 요청 → 인스턴스 시작 (2-3분)
         → 사용자 대기 😞
```

**해결책 1: 최소 1대 유지**

```python
min_replica_count=1  # Always On
```

**비용 증가**:
```
$0.45 × 24시간 × 30일 = $324/월

단, Cold Start 없음 → UX 향상
```

→ **비용과 UX 사이의 트레이드오프**

## 최적화 전략 1: 스마트 스케일링

### 시간대별 최소 인스턴스 조정

```python
# scripts/adjust_scaling_by_hour.py
from google.cloud import aiplatform
from datetime import datetime
import time

def adjust_min_replicas():
    """시간대에 따라 최소 인스턴스 조정"""
    endpoint = aiplatform.Endpoint(ENDPOINT_NAME)

    current_hour = datetime.now().hour

    # 업무 시간 (9-18시): 최소 1대
    if 9 <= current_hour < 18:
        min_replicas = 1
    # 야간: 최소 0대
    else:
        min_replicas = 0

    # 스케일링 설정 업데이트
    endpoint.update(
        min_replica_count=min_replicas,
        max_replica_count=3
    )

    print(f"✅ {current_hour}시: 최소 인스턴스 = {min_replicas}")

if __name__ == "__main__":
    while True:
        adjust_min_replicas()
        time.sleep(3600)  # 1시간마다 체크
```

**cron으로 자동화**:

```bash
# Cloud Scheduler 사용
gcloud scheduler jobs create http adjust-scaling \
    --schedule="0 * * * *" \
    --uri="https://us-central1-aiplatform.googleapis.com/v1/projects/PROJECT/locations/us-central1/endpoints/ENDPOINT" \
    --http-method=PATCH \
    --message-body='{"min_replica_count": 1}' \
    --oauth-service-account-email=vertex-ai-worker@PROJECT.iam.gserviceaccount.com
```

**비용 절감 효과**:
```
기존 (24시간 Always On): $324/월
최적화 (9시간만 Always On): $121.50/월

절감: $202.50/월 (62% 감소)
```

## 최적화 전략 2: 예측 기반 Warm-up

### 사용 패턴 분석

```python
# scripts/analyze_usage_pattern.py
import redis
from datetime import datetime, timedelta
from collections import defaultdict

redis_client = redis.Redis(decode_responses=True)

def analyze_hourly_requests():
    """시간대별 요청 패턴 분석"""
    hourly_stats = defaultdict(int)

    # 최근 7일 데이터
    for i in range(7):
        day = (datetime.now() - timedelta(days=i)).strftime('%Y-%m-%d')

        for hour in range(24):
            key = f"requests:{day}:{hour}"
            count = redis_client.get(key) or 0
            hourly_stats[hour] += int(count)

    # 평균 계산
    avg_by_hour = {h: count/7 for h, count in hourly_stats.items()}

    # 피크 시간대 파악 (평균 > 10 req/hr)
    peak_hours = [h for h, avg in avg_by_hour.items() if avg > 10]

    print("=== 시간대별 평균 요청 수 ===")
    for hour in sorted(avg_by_hour.keys()):
        print(f"{hour:02d}시: {avg_by_hour[hour]:.1f} req/hr")

    print(f"\n피크 시간대: {peak_hours}")

    return peak_hours
```

### 예측 Warm-up

```python
# scripts/predictive_warmup.py
from google.cloud import aiplatform
import schedule
import time

def warmup_endpoint():
    """예측 Warm-up: 피크 시간 10분 전에 인스턴스 시작"""
    endpoint = aiplatform.Endpoint(ENDPOINT_NAME)

    # 최소 인스턴스 1로 설정
    endpoint.update(min_replica_count=1)

    # 더미 요청 전송 (모델 로딩 트리거)
    try:
        endpoint.predict(instances=[{
            "image": "dummy_base64",
            "layers": 1,
            "resolution": 512
        }], timeout=60)
    except:
        pass  # 에러 무시 (목적은 warm-up)

    print("✅ Endpoint warmed up")

# 피크 시간 10분 전에 실행
peak_hours = [9, 12, 14, 16]  # 분석 결과 기반

for hour in peak_hours:
    warmup_time = f"{hour-1}:50"  # 예: 9시 피크 → 8:50 warm-up
    schedule.every().day.at(warmup_time).do(warmup_endpoint)

while True:
    schedule.run_pending()
    time.sleep(60)
```

**효과**:
```
Cold Start 제거: 피크 시간에는 항상 Warm 상태
비용 최소화: 오프 시간에는 자동 종료
```

## 최적화 전략 3: 결과 캐싱

### 중복 요청 감지

동일한 이미지를 여러 사용자가 업로드할 수 있다.

```python
# services/cache_service.py
import hashlib
import redis
from PIL import Image

redis_client = redis.Redis()

def get_image_hash(image_path: str) -> str:
    """이미지 해시 생성"""
    with open(image_path, "rb") as f:
        return hashlib.sha256(f.read()).hexdigest()

def check_cache(image_hash: str, num_layers: int, resolution: int) -> dict:
    """캐시 확인"""
    cache_key = f"cache:{image_hash}:{num_layers}:{resolution}"
    cached = redis_client.get(cache_key)

    if cached:
        import json
        return json.loads(cached)

    return None

def save_cache(image_hash: str, num_layers: int, resolution: int, result: dict):
    """결과 캐싱 (24시간)"""
    cache_key = f"cache:{image_hash}:{num_layers}:{resolution}"
    import json
    redis_client.setex(cache_key, 86400, json.dumps(result))
```

### Worker 통합

```python
# worker.py 수정
async def process_job(job_id: str, job_data: dict):
    queue = JobQueue()
    settings = get_settings()

    # 1. 이미지 해시 계산
    image_hash = get_image_hash(job_data["image_path"])

    # 2. 캐시 확인
    cached_result = check_cache(
        image_hash,
        job_data["num_layers"],
        job_data["resolution"]
    )

    if cached_result:
        # 캐시 히트! 즉시 반환
        queue.update_job(
            job_id,
            status="completed",
            progress=100,
            message="완료! (캐시됨)",
            layers=cached_result["layers"]
        )
        return

    # 3. 캐시 미스: Vertex AI 호출
    try:
        client = VertexAIClient()
        layers = await client.decompose_image_async(...)

        # 결과 저장 및 캐싱
        # ...
        save_cache(image_hash, num_layers, resolution, {"layers": layer_info})

    except Exception as e:
        # ...
```

**비용 절감 효과**:
```
캐시 히트율 20% 가정:
월 10,000회 요청 → 2,000회는 캐시에서 처리

Vertex AI 호출: 8,000회
비용: $0.00375 × 8,000 = $30

절감: $7.50/월 (20%)
```

## 최적화 전략 4: Spot Instances

Google Cloud는 **Preemptible GPU**를 제공한다 (70% 할인).

**주의**: 언제든지 중단될 수 있음

### Spot Instances 설정

```python
# deploy 시
model.deploy(
    endpoint=endpoint,
    machine_type="n1-standard-4",
    accelerator_type="NVIDIA_TESLA_T4",
    accelerator_count=1,
    min_replica_count=0,
    max_replica_count=3,
    # Spot Instances 사용
    spot=True  # 70% 할인!
)
```

**비용**:
```
일반 T4: $0.45/hr
Spot T4: $0.135/hr

1,000회/월:
- 일반: $3.75
- Spot: $1.13

절감: $2.62/월 (70%)
```

**단점**:
- 갑자기 중단될 수 있음
- 추론 중 중단 시 재시도 필요

### Spot 중단 대응

```python
async def process_job_with_spot(job_id: str, job_data: dict):
    """Spot 중단 시 일반 인스턴스로 폴백"""

    try:
        # Spot 엔드포인트 사용
        result = await spot_client.decompose_image_async(...)
        return result

    except Exception as e:
        if "preempted" in str(e).lower():
            # Spot 중단됨 → 일반 인스턴스로 재시도
            print("⚠️ Spot instance preempted, retrying on regular instance")
            result = await regular_client.decompose_image_async(...)
            return result

        raise
```

## 최적화 전략 5: 배치 처리

여러 요청을 묶어서 처리하면 GPU 활용률 증가.

### 배치 API

```python
# models/vertex_ai_client.py

def decompose_images_batch(
    self,
    image_paths: List[str],
    num_layers: int = 5,
    resolution: int = 1024
) -> List[List[Image.Image]]:
    """배치 처리"""

    # 모든 이미지를 Base64로 인코딩
    instances = []
    for path in image_paths:
        with open(path, "rb") as f:
            image_b64 = base64.b64encode(f.read()).decode()
            instances.append({
                "image": image_b64,
                "layers": num_layers,
                "resolution": resolution
            })

    # 한 번에 요청
    response = self.endpoint.predict(instances=instances)

    # 결과 파싱
    all_layers = []
    for prediction in response.predictions:
        layers_b64 = prediction["layers"]
        layers = [
            Image.open(io.BytesIO(base64.b64decode(lb)))
            for lb in layers_b64
        ]
        all_layers.append(layers)

    return all_layers
```

**비용 효과**:
```
개별 처리:
- 요청 3개 × 30초 = 90초 GPU 사용

배치 처리:
- 요청 3개를 동시 처리 = 45초 GPU 사용

절감: 50%
```

## 비용 모니터링 대시보드

### Cloud Monitoring 알림

```bash
# 비용 초과 알림
gcloud alpha monitoring policies create \
    --notification-channels=CHANNEL_ID \
    --display-name="Vertex AI Cost Alert" \
    --condition-display-name="Daily cost > $10" \
    --condition-threshold-value=10 \
    --condition-threshold-duration=300s
```

### Grafana 대시보드

```python
# scripts/export_metrics_to_prometheus.py
from prometheus_client import Gauge, start_http_server
import redis

# 메트릭 정의
daily_cost = Gauge('vertex_ai_daily_cost_usd', 'Daily Vertex AI cost in USD')
request_count = Gauge('vertex_ai_request_count', 'Total requests today')
cache_hit_rate = Gauge('cache_hit_rate', 'Cache hit rate percentage')

redis_client = redis.Redis(decode_responses=True)

def update_metrics():
    """메트릭 업데이트"""
    today_key = f"cost:{datetime.now().strftime('%Y-%m-%d')}"
    cost = float(redis_client.get(today_key) or 0)
    daily_cost.set(cost)

    # 요청 수
    requests_key = f"requests:{datetime.now().strftime('%Y-%m-%d')}"
    count = int(redis_client.get(requests_key) or 0)
    request_count.set(count)

    # 캐시 히트율
    hits = int(redis_client.get(f"cache_hits:{datetime.now().strftime('%Y-%m-%d')}") or 0)
    hit_rate = (hits / count * 100) if count > 0 else 0
    cache_hit_rate.set(hit_rate)

if __name__ == "__main__":
    start_http_server(8001)
    while True:
        update_metrics()
        time.sleep(60)
```

## 최종 비용 비교

| 전략 | 월 비용 (10,000회 기준) | 절감률 |
|------|----------------------|--------|
| 기본 (Always On) | $324 | 0% |
| 시간대별 스케일링 | $121.50 | 62% |
| + 캐싱 (20%) | $97.20 | 70% |
| + Spot Instances | $29.16 | 91% |
| + 배치 처리 | $14.58 | 95% |

**결론**: **적극적 최적화로 95% 비용 절감 가능**

## 다음 단계

v7에서는 **성능 벤치마크**를 진행한다:
1. 로컬 GPU vs Vertex AI 속도 비교
2. Cold Start vs Warm Start 측정
3. 동시 요청 처리 능력 테스트
4. 최적 설정 도출

최적화 전략을 세웠으니, 실제 성능을 정량적으로 평가해보자.

---

**이전 글**: [API 엔드포인트 재설계 (5/10)](./update-qwen-image-layered-project-v5.md)

**다음 글**: [성능 벤치마크 (7/10)](./update-qwen-image-layered-project-v7.md)

**참고 자료**:
- [Vertex AI Auto-scaling](https://cloud.google.com/vertex-ai/docs/predictions/autoscaling)
- [Spot VMs on Google Cloud](https://cloud.google.com/compute/docs/instances/spot)
- [Cloud Monitoring Alerts](https://cloud.google.com/monitoring/alerts)