# Qwen-Image-Layered 클라우드 전환 (7/10): 성능 벤치마크

## 벤치마크 목표

클라우드 전환의 실제 효과를 측정한다:

1. **처리 속도**: 로컬 GPU vs Vertex AI
2. **Cold Start 영향**: 최소 인스턴스 0 vs 1
3. **동시 처리**: 스케일링 효과
4. **네트워크 오버헤드**: API 호출 비용

## 테스트 환경

### 로컬 GPU 환경

```
Hardware:
- GPU: NVIDIA RTX 3090 (24GB)
- CPU: Intel i9-12900K
- RAM: 64GB

Software:
- CUDA: 11.8
- PyTorch: 2.1.0
- Qwen-Image-Layered: 최신 버전
```

### Vertex AI 환경

```
Instance:
- GPU: NVIDIA T4 (16GB)
- vCPU: 4 cores
- RAM: 15GB

Configuration:
- Min replicas: 0 (테스트 1)
- Min replicas: 1 (테스트 2)
- Max replicas: 3
```

## 벤치마크 1: 단일 요청 처리 시간

### 테스트 케이스

```python
# benchmark_single_request.py
import time
from PIL import Image
import statistics

# 테스트 이미지
test_cases = [
    {"name": "Small", "size": (512, 512), "layers": 3},
    {"name": "Medium", "size": (1024, 1024), "layers": 5},
    {"name": "Large", "size": (2048, 2048), "layers": 5},
]

def benchmark_local_gpu():
    """로컬 GPU 벤치마크"""
    from diffusers import QwenImageLayeredPipeline
    import torch

    pipeline = QwenImageLayeredPipeline.from_pretrained(
        "Qwen/Qwen-Image-Layered"
    )
    pipeline.to("cuda", torch.bfloat16)

    results = {}

    for test in test_cases:
        # 테스트 이미지 생성
        image = Image.new("RGB", test["size"], color="white")

        # 5회 측정
        times = []
        for _ in range(5):
            start = time.time()
            layers = pipeline(
                image=image,
                layers=test["layers"],
                resolution=test["size"][0]
            )
            end = time.time()
            times.append(end - start)

        results[test["name"]] = {
            "mean": statistics.mean(times),
            "std": statistics.stdev(times),
            "min": min(times),
            "max": max(times)
        }

    return results

def benchmark_vertex_ai():
    """Vertex AI 벤치마크"""
    from app.models.vertex_ai_client import VertexAIClient
    import tempfile

    client = VertexAIClient()
    results = {}

    for test in test_cases:
        # 테스트 이미지 생성
        image = Image.new("RGB", test["size"], color="white")

        # 임시 파일로 저장
        with tempfile.NamedTemporaryFile(suffix=".jpg") as f:
            image.save(f, format="JPEG")
            f.flush()

            # 5회 측정
            times = []
            for _ in range(5):
                start = time.time()
                layers = client.decompose_image(
                    image_path=f.name,
                    num_layers=test["layers"],
                    resolution=test["size"][0]
                )
                end = time.time()
                times.append(end - start)

            results[test["name"]] = {
                "mean": statistics.mean(times),
                "std": statistics.stdev(times),
                "min": min(times),
                "max": max(times)
            }

    return results

if __name__ == "__main__":
    print("=== 로컬 GPU 벤치마크 ===")
    local_results = benchmark_local_gpu()
    for name, stats in local_results.items():
        print(f"{name}: {stats['mean']:.2f}s ± {stats['std']:.2f}s")

    print("\n=== Vertex AI 벤치마크 (Warm) ===")
    vertex_results = benchmark_vertex_ai()
    for name, stats in vertex_results.items():
        print(f"{name}: {stats['mean']:.2f}s ± {stats['std']:.2f}s")
```

### 결과

```
=== 로컬 GPU (RTX 3090) ===
Small (512px, 3 layers):   12.3s ± 0.5s
Medium (1024px, 5 layers): 28.7s ± 1.2s
Large (2048px, 5 layers):  67.4s ± 2.8s

=== Vertex AI (T4, Warm) ===
Small (512px, 3 layers):   15.1s ± 0.8s
Medium (1024px, 5 layers): 34.2s ± 1.5s
Large (2048px, 5 layers):  82.3s ± 3.2s

=== 네트워크 오버헤드 ===
Small:   +2.8s (23% 증가)
Medium:  +5.5s (19% 증가)
Large:   +14.9s (22% 증가)
```

**분석**:
- RTX 3090이 T4보다 약 20-25% 빠름 (예상됨)
- 네트워크 오버헤드는 이미지 크기에 비례
- 차이는 크지 않음 (수용 가능)

## 벤치마크 2: Cold Start vs Warm Start

### 테스트

```python
# benchmark_cold_start.py
import time
from google.cloud import aiplatform

def trigger_cold_start():
    """Cold Start 트리거"""
    endpoint = aiplatform.Endpoint(ENDPOINT_NAME)

    # 1. 최소 인스턴스 0으로 설정
    endpoint.update(min_replica_count=0)

    # 2. 15분 대기 (인스턴스 종료)
    print("Waiting for instance to shutdown...")
    time.sleep(900)

    # 3. 요청 전송 (Cold Start)
    print("Sending request (Cold Start)...")
    start = time.time()

    response = endpoint.predict(
        instances=[{
            "image": test_image_b64,
            "layers": 5,
            "resolution": 1024
        }]
    )

    end = time.time()
    cold_start_time = end - start

    print(f"Cold Start: {cold_start_time:.2f}s")

    # 4. 즉시 다시 요청 (Warm Start)
    print("Sending request (Warm Start)...")
    start = time.time()

    response = endpoint.predict(
        instances=[{
            "image": test_image_b64,
            "layers": 5,
            "resolution": 1024
        }]
    )

    end = time.time()
    warm_start_time = end - start

    print(f"Warm Start: {warm_start_time:.2f}s")

    return cold_start_time, warm_start_time
```

### 결과

```
=== Cold Start ===
인스턴스 시작: 127s
모델 로딩: 43s
추론: 34s
총: 204s (3분 24초)

=== Warm Start ===
추론만: 34s

차이: 170s (Cold Start가 5배 느림)
```

**결론**:
- Cold Start는 UX에 치명적
- 업무 시간에는 **반드시 최소 1대 유지** 필요

## 벤치마크 3: 동시 요청 처리

### 테스트: 부하 테스트

```python
# benchmark_concurrent.py
import asyncio
import aiohttp
import time

async def send_request(session, request_id):
    """비동기 요청"""
    url = "http://localhost:8000/api/decompose"
    payload = {
        "file_id": "test-uuid",
        "num_layers": 5,
        "resolution": 1024
    }

    start = time.time()
    async with session.post(url, json=payload) as response:
        result = await response.json()

    # 작업 완료 대기 (polling)
    job_id = result["job_id"]
    while True:
        async with session.get(f"http://localhost:8000/api/status/{job_id}") as resp:
            status = await resp.json()
            if status["status"] in ["completed", "failed"]:
                break
        await asyncio.sleep(1)

    end = time.time()
    duration = end - start

    return request_id, duration, status["status"]

async def benchmark_concurrent(num_requests):
    """동시 요청 벤치마크"""
    async with aiohttp.ClientSession() as session:
        tasks = [
            send_request(session, i)
            for i in range(num_requests)
        ]

        results = await asyncio.gather(*tasks)

    return results

if __name__ == "__main__":
    for concurrency in [1, 5, 10, 20]:
        print(f"\n=== {concurrency} 동시 요청 ===")

        results = asyncio.run(benchmark_concurrent(concurrency))

        durations = [r[1] for r in results if r[2] == "completed"]
        successes = len([r for r in results if r[2] == "completed"])

        print(f"성공률: {successes}/{concurrency} ({successes/concurrency*100:.1f}%)")
        print(f"평균 처리 시간: {sum(durations)/len(durations):.2f}s")
        print(f"최소: {min(durations):.2f}s")
        print(f"최대: {max(durations):.2f}s")
```

### 결과

#### 로컬 GPU (단일 인스턴스)

```
=== 1 동시 요청 ===
성공률: 1/1 (100%)
평균: 28.7s

=== 5 동시 요청 ===
성공률: 5/5 (100%)
평균: 142.3s (큐 대기 포함)

=== 10 동시 요청 ===
성공률: 10/10 (100%)
평균: 285.1s (4분 45초!)

=== 20 동시 요청 ===
성공률: 8/20 (40%) - GPU OOM 에러
```

**문제**: 로컬 GPU는 동시 처리 불가. 순차 처리만 가능.

#### Vertex AI (오토 스케일링)

```
=== 1 동시 요청 ===
성공률: 1/1 (100%)
평균: 34.2s

=== 5 동시 요청 ===
성공률: 5/5 (100%)
평균: 38.1s (스케일링 시작)

=== 10 동시 요청 ===
성공률: 10/10 (100%)
평균: 41.7s (인스턴스 3대로 증가)

=== 20 동시 요청 ===
성공률: 20/20 (100%)
평균: 47.3s (대기 시간 포함)
```

**결론**:
- Vertex AI는 자동으로 인스턴스 증가
- 처리 시간이 거의 일정 (병렬 처리)
- **확장성에서 압도적 우위**

## 벤치마크 4: 비용 효율성

### 시나리오: 월 10,000회 요청

```python
# 로컬 GPU
고정 비용: $80/월 (전기, 감가상각)
변동 비용: $0
총: $80/월

1회당: $0.008

# Vertex AI (최적화 전)
GPU 사용 시간: 10,000 × 30s = 83.3 hours
비용: 83.3 × $0.45 = $37.50/월

1회당: $0.00375

# Vertex AI (최적화 후)
캐싱 (20%): 8,000회만 실제 호출
Spot Instances (70% 할인): $0.135/hr
GPU 사용: 8,000 × 30s = 66.7 hours
비용: 66.7 × $0.135 = $9/월

1회당: $0.0009
```

**결론**:
- 최적화된 Vertex AI가 **로컬 GPU보다 89% 저렴**
- 1,000회 미만에서는 Vertex AI가 압도적

## 최종 비교표

| 항목 | 로컬 GPU (RTX 3090) | Vertex AI (T4) |
|------|---------------------|----------------|
| **성능** | | |
| 추론 속도 (1024px) | 28.7s | 34.2s |
| Cold Start | N/A | 204s |
| Warm Start | 28.7s | 34.2s |
| 동시 처리 (10개) | 285s (순차) | 42s (병렬) |
| **비용** (월 10,000회) | | |
| 고정 비용 | $80 | $0 |
| 변동 비용 | $0 | $9 (최적화) |
| 총 비용 | $80 | $9 |
| 1회당 | $0.008 | $0.0009 |
| **확장성** | | |
| 최대 동시 요청 | 1-2개 | 무제한 |
| 스케일링 | 불가능 | 자동 |
| **운영** | | |
| 하드웨어 관리 | 필요 | 불필요 |
| 유지보수 | 복잡 | 간단 |

## 권장 사항

### 사용량별 추천

```
< 100회/월:
→ Vertex AI (min=0)
이유: 고정 비용 없음

100-5,000회/월:
→ Vertex AI (min=0, 피크시 min=1)
이유: 비용 효율적

5,000-50,000회/월:
→ Vertex AI (min=1-2)
이유: UX와 비용 균형

50,000+회/월:
→ 로컬 GPU 고려
이유: 고정 비용이 더 저렴할 수 있음
```

### 설정 추천

```python
# 대부분의 경우 최적 설정

# 업무 시간 (9-18시)
min_replica_count = 1  # Warm 유지
max_replica_count = 3

# 야간 (18-09시)
min_replica_count = 0  # 비용 절감
max_replica_count = 1

# 추가 최적화
spot = True           # 70% 할인
cache_ttl = 86400     # 24시간 캐싱
batch_size = 3        # 배치 처리
```

## 다음 단계

v8에서는 **하이브리드 아키텍처**를 구현한다:
1. Vertex AI 우선 사용
2. 실패 시 Hugging Face로 폴백
3. 모두 실패 시 로컬 GPU (가능하면)
4. 복원력(Resilience) 향상

벤치마크로 성능을 확인했으니, 이제 안정성을 높일 차례다.

---

**이전 글**: [비용 최적화 전략 (6/10)](./update-qwen-image-layered-project-v6.md)

**다음 글**: [하이브리드 아키텍처 구현 (8/10)](./update-qwen-image-layered-project-v8.md)

**벤치마크 코드**: [GitHub - benchmarks/](https://github.com/your-org/poster-decomposer/tree/main/benchmarks)