# Streaming Avatar 개발기 - v4: MuseTalk 립싱크 엔진 구현

## 개요

오픈소스 립싱크 모델인 MuseTalk을 설치하고, 실시간 립싱크 파이프라인을 구현합니다.

## MuseTalk 소개

[MuseTalk](https://github.com/TMElyralab/MuseTalk)는 Tencent에서 공개한 실시간 립싱크 모델로:

- **실시간 처리**: 30+ FPS on RTX 3090
- **고품질**: Latent space inpainting 기반
- **유연성**: 다양한 얼굴/스타일 지원
- **라이선스**: MIT (상업적 사용 가능)

## 1. 환경 설정

### 시스템 요구사항

```bash
# 최소 사양
- NVIDIA GPU: RTX 3060 (12GB VRAM) 이상
- CUDA: 11.8 이상
- Python: 3.10
- RAM: 16GB 이상
```

### 설치

```bash
# 1. 저장소 클론
git clone https://github.com/TMElyralab/MuseTalk.git
cd MuseTalk

# 2. Conda 환경 생성
conda create -n musetalk python=3.10
conda activate musetalk

# 3. PyTorch 설치 (CUDA 11.8)
pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 \
    --index-url https://download.pytorch.org/whl/cu118

# 4. 의존성 설치
pip install -r requirements.txt

# 5. 모델 다운로드
python download_models.py
```

### 모델 구조

```
models/
├── musetalk/
│   ├── musetalk.json          # 설정 파일
│   ├── pytorch_model.bin      # 메인 모델 (~2GB)
│   └── audio_encoder.bin      # 오디오 인코더
├── dwpose/
│   └── dw-ll_ucoco_384.onnx   # 포즈 감지
├── face-parse-bisenet/
│   └── model.pth              # 얼굴 파싱
└── sd-vae-ft-mse/
    └── diffusion_pytorch_model.bin  # VAE
```

## 2. 기본 사용법

### 단일 이미지 립싱크

```python
# test_musetalk.py
from musetalk.inference import MuseTalkInference
import soundfile as sf

# 모델 로드
model = MuseTalkInference(
    musetalk_config="models/musetalk/musetalk.json",
    musetalk_model="models/musetalk/pytorch_model.bin",
    device="cuda"
)

# 아바타 이미지 로드
avatar = model.preprocess_avatar("assets/avatar.png")

# 오디오 로드
audio, sr = sf.read("assets/speech.wav")

# 립싱크 영상 생성
frames = model.generate(
    avatar=avatar,
    audio=audio,
    sample_rate=sr
)

# 저장
model.save_video(frames, "output/lipsync.mp4", fps=25)
```

### 출력 예시

```
Input:  [이미지] + [오디오 5초]
Output: [영상 5초, 25fps, 125프레임]
처리 시간: ~3초 (RTX 4090 기준)
```

## 3. 실시간 파이프라인 구현

### 스트리밍 아키텍처

```
┌─────────────────────────────────────────────────────────────┐
│              Real-time Lip Sync Pipeline                     │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  [Audio Stream]                                              │
│       │                                                      │
│       ▼                                                      │
│  ┌─────────────────┐                                        │
│  │ Audio Chunker   │  Split into 100ms chunks               │
│  │ (2400 samples)  │                                        │
│  └────────┬────────┘                                        │
│           │                                                  │
│           ▼                                                  │
│  ┌─────────────────┐                                        │
│  │ Audio Encoder   │  Hubert features extraction            │
│  │ (GPU)           │  → 768-dim vectors                     │
│  └────────┬────────┘                                        │
│           │                                                  │
│           ▼                                                  │
│  ┌─────────────────┐                                        │
│  │ MuseTalk Core   │  Latent space lip generation           │
│  │ (GPU)           │  → Face with lip sync                  │
│  └────────┬────────┘                                        │
│           │                                                  │
│           ▼                                                  │
│  ┌─────────────────┐                                        │
│  │ VAE Decoder     │  Latent → RGB frames                   │
│  │ (GPU)           │  → 256x256 or 512x512                  │
│  └────────┬────────┘                                        │
│           │                                                  │
│           ▼                                                  │
│  [Video Frames] → WebRTC Stream                             │
│                                                              │
└─────────────────────────────────────────────────────────────┘
```

### 실시간 엔진 클래스

```python
# src/lipsync/realtime_engine.py
import torch
import numpy as np
import asyncio
from collections import deque
from musetalk.models import MuseTalkModel, AudioEncoder

class RealtimeLipSyncEngine:
    def __init__(
        self,
        model_path: str = "models/musetalk",
        device: str = "cuda",
        buffer_size: int = 4  # 프레임 버퍼
    ):
        self.device = torch.device(device)
        self.buffer_size = buffer_size

        # 모델 로드
        self.audio_encoder = AudioEncoder(
            model_path=f"{model_path}/audio_encoder.bin"
        ).to(self.device)

        self.musetalk = MuseTalkModel(
            config_path=f"{model_path}/musetalk.json",
            model_path=f"{model_path}/pytorch_model.bin"
        ).to(self.device)

        # 프레임 버퍼
        self.frame_buffer = deque(maxlen=buffer_size)

        # 아바타 캐시
        self.avatar_cache = {}

    @torch.inference_mode()
    def preprocess_avatar(self, image_path: str) -> dict:
        """아바타 이미지 전처리 및 캐싱"""
        if image_path in self.avatar_cache:
            return self.avatar_cache[image_path]

        # 이미지 로드 및 전처리
        avatar_data = self.musetalk.prepare_avatar(image_path)

        self.avatar_cache[image_path] = avatar_data
        return avatar_data

    @torch.inference_mode()
    async def process_audio_chunk(
        self,
        avatar_data: dict,
        audio_chunk: np.ndarray,
        sample_rate: int = 16000
    ) -> list[np.ndarray]:
        """오디오 청크를 프레임으로 변환"""

        # 1. 오디오 → 특징 벡터
        audio_tensor = torch.from_numpy(audio_chunk).float().to(self.device)
        audio_features = self.audio_encoder(audio_tensor, sample_rate)

        # 2. 특징 → 립싱크 프레임
        frames = []
        chunk_duration = len(audio_chunk) / sample_rate
        num_frames = int(chunk_duration * 25)  # 25 FPS

        for i in range(num_frames):
            # 해당 시간의 오디오 특징 보간
            t = i / num_frames
            feat_idx = int(t * len(audio_features))
            audio_feat = audio_features[feat_idx]

            # 립싱크 프레임 생성
            frame = self.musetalk.generate_frame(
                avatar=avatar_data,
                audio_features=audio_feat
            )
            frames.append(frame)

        return frames

    async def stream_lipsync(
        self,
        avatar_path: str,
        audio_stream: asyncio.Queue
    ):
        """실시간 립싱크 스트리밍"""
        avatar_data = self.preprocess_avatar(avatar_path)

        while True:
            # 오디오 청크 대기
            audio_chunk = await audio_stream.get()

            if audio_chunk is None:  # 종료 신호
                break

            # 프레임 생성
            frames = await self.process_audio_chunk(
                avatar_data, audio_chunk
            )

            # 프레임 버퍼에 추가
            for frame in frames:
                self.frame_buffer.append(frame)
                yield frame
```

### 프레임 버퍼링

```python
# src/lipsync/frame_buffer.py
import asyncio
from dataclasses import dataclass
from typing import Optional
import numpy as np

@dataclass
class TimedFrame:
    frame: np.ndarray
    timestamp_ms: float

class FrameBuffer:
    """지터 흡수를 위한 프레임 버퍼"""

    def __init__(
        self,
        target_fps: int = 25,
        buffer_ms: int = 200  # 200ms 버퍼
    ):
        self.target_fps = target_fps
        self.frame_duration_ms = 1000 / target_fps
        self.buffer_ms = buffer_ms

        self.frames: asyncio.Queue[TimedFrame] = asyncio.Queue()
        self.last_frame: Optional[TimedFrame] = None
        self.running = False

    async def add_frame(self, frame: np.ndarray, timestamp_ms: float):
        """프레임 추가"""
        await self.frames.put(TimedFrame(frame, timestamp_ms))

    async def get_frame(self, current_time_ms: float) -> np.ndarray:
        """현재 시간에 맞는 프레임 반환"""

        # 버퍼가 충분히 찼는지 확인
        if self.frames.qsize() < (self.buffer_ms / self.frame_duration_ms):
            # 버퍼 부족 - 마지막 프레임 재사용
            if self.last_frame:
                return self.last_frame.frame
            return None

        # 현재 시간에 맞는 프레임 찾기
        while not self.frames.empty():
            frame = await self.frames.get()
            self.last_frame = frame

            # 현재 시간보다 앞선 프레임이면 반환
            if frame.timestamp_ms >= current_time_ms - self.buffer_ms:
                return frame.frame

        return self.last_frame.frame if self.last_frame else None

    async def smooth_playback(self):
        """부드러운 재생을 위한 프레임 스케줄러"""
        start_time = asyncio.get_event_loop().time() * 1000
        frame_count = 0

        self.running = True
        while self.running:
            current_time = asyncio.get_event_loop().time() * 1000
            elapsed = current_time - start_time

            expected_frame = int(elapsed / self.frame_duration_ms)

            if expected_frame > frame_count:
                frame = await self.get_frame(elapsed)
                if frame is not None:
                    yield frame
                    frame_count = expected_frame

            # 다음 프레임까지 대기
            next_frame_time = (frame_count + 1) * self.frame_duration_ms
            sleep_time = max(0, (next_frame_time - elapsed) / 1000)
            await asyncio.sleep(sleep_time)
```

## 4. 성능 최적화

### GPU 메모리 최적화

```python
# src/lipsync/optimizations.py
import torch
from torch.cuda.amp import autocast

class OptimizedLipSyncEngine(RealtimeLipSyncEngine):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

        # Mixed precision 활성화
        self.use_amp = True

        # 모델 최적화
        self.musetalk = torch.compile(
            self.musetalk,
            mode="reduce-overhead"  # 실시간 추론 최적화
        )

    @torch.inference_mode()
    async def process_audio_chunk(self, avatar_data, audio_chunk, sample_rate=16000):
        """AMP를 사용한 최적화된 처리"""
        with autocast(dtype=torch.float16):
            return await super().process_audio_chunk(
                avatar_data, audio_chunk, sample_rate
            )
```

### 배치 처리

```python
# src/lipsync/batch_processor.py
import torch
from typing import List

class BatchLipSyncProcessor:
    """여러 프레임을 배치로 처리하여 처리량 향상"""

    def __init__(self, engine: RealtimeLipSyncEngine, batch_size: int = 4):
        self.engine = engine
        self.batch_size = batch_size

    @torch.inference_mode()
    def process_batch(
        self,
        avatar_data: dict,
        audio_features: torch.Tensor
    ) -> List[np.ndarray]:
        """배치 단위로 프레임 생성"""

        num_features = len(audio_features)
        frames = []

        for i in range(0, num_features, self.batch_size):
            batch_features = audio_features[i:i + self.batch_size]

            # 배치 처리
            batch_frames = self.engine.musetalk.generate_batch(
                avatar=avatar_data,
                audio_features=batch_features
            )

            frames.extend(batch_frames)

        return frames
```

## 5. 테스트

### 단위 테스트

```python
# tests/test_lipsync.py
import pytest
import numpy as np
from src.lipsync.realtime_engine import RealtimeLipSyncEngine

@pytest.fixture
def engine():
    return RealtimeLipSyncEngine(device="cuda")

def test_avatar_preprocessing(engine):
    """아바타 전처리 테스트"""
    avatar_data = engine.preprocess_avatar("test_assets/avatar.png")

    assert "face_latent" in avatar_data
    assert "mask" in avatar_data
    assert avatar_data["face_latent"].shape[-1] == 512

@pytest.mark.asyncio
async def test_realtime_processing(engine):
    """실시간 처리 테스트"""
    avatar_data = engine.preprocess_avatar("test_assets/avatar.png")

    # 100ms 오디오 (16kHz)
    audio_chunk = np.random.randn(1600).astype(np.float32)

    frames = await engine.process_audio_chunk(
        avatar_data, audio_chunk, sample_rate=16000
    )

    # 100ms = ~2.5 frames at 25fps
    assert 2 <= len(frames) <= 3
    assert frames[0].shape == (256, 256, 3)
```

### 성능 벤치마크

```python
# benchmarks/benchmark_lipsync.py
import time
import torch
from src.lipsync.realtime_engine import RealtimeLipSyncEngine

def benchmark():
    engine = RealtimeLipSyncEngine()
    avatar_data = engine.preprocess_avatar("test_assets/avatar.png")

    # 1초 분량 오디오
    audio = np.random.randn(16000).astype(np.float32)

    # 워밍업
    for _ in range(5):
        engine.process_audio_chunk(avatar_data, audio[:1600])

    # 벤치마크
    torch.cuda.synchronize()
    start = time.time()

    for _ in range(100):
        frames = engine.process_audio_chunk(avatar_data, audio)

    torch.cuda.synchronize()
    elapsed = time.time() - start

    print(f"100회 처리 시간: {elapsed:.2f}s")
    print(f"평균 처리 시간: {elapsed/100*1000:.1f}ms/sec of audio")
    print(f"실시간 비율: {100/(elapsed/100):.1f}x")

# 예상 결과 (RTX 4090):
# 100회 처리 시간: 3.2s
# 평균 처리 시간: 32ms/sec of audio
# 실시간 비율: 31x (실시간의 31배 속도)
```

## 다음 단계 (v5)

WebRTC를 설정하고 실시간 비디오 스트리밍을 구현합니다.

---

## 참고 자료

- [MuseTalk GitHub](https://github.com/TMElyralab/MuseTalk)
- [MuseTalk Paper](https://arxiv.org/abs/2401.03789)
- [PyTorch AMP Guide](https://pytorch.org/docs/stable/amp.html)

*이 시리즈는 총 10개의 포스트로 구성되어 있습니다.*