# v8: 데이터 전처리 및 프롬프트 최적화

## 데이터 전처리의 중요성

AI가 정확한 분석을 하려면 깔끔하고 구조화된 데이터가 필요합니다. 전처리를 통해:

1. **노이즈 제거**: 불필요한 데이터 제거
2. **정규화**: 일관된 형식으로 변환
3. **집계**: 의미 있는 단위로 그룹화
4. **컨텍스트 축소**: 토큰 사용량 최적화

## 고급 데이터 전처리 구현

### src/utils/data-preprocessor.ts

```typescript
import { MFTransaction } from '../types';
import { format, parseISO } from 'date-fns';

export class DataPreprocessor {
  /**
   * 거래 데이터 정제
   */
  static clean(transactions: MFTransaction[]): MFTransaction[] {
    return transactions
      .filter((t) => t.amount > 0) // 금액이 0인 거래 제외
      .filter((t) => t.content && t.content.trim() !== '') // 빈 내용 제외
      .map((t) => ({
        ...t,
        content: this.cleanContent(t.content),
        amount: Math.abs(t.amount), // 절대값으로 통일
      }));
  }

  /**
   * 거래 내용 정제
   */
  private static cleanContent(content: string): string {
    return content
      .trim()
      .replace(/\s+/g, ' ') // 여러 공백을 하나로
      .replace(/[^\w\s가-힣ぁ-んァ-ヶー一-龯]/g, '') // 특수문자 제거 (영어, 한글, 일본어만)
      .slice(0, 50); // 최대 50자로 제한
  }

  /**
   * 토큰 사용량 최적화를 위한 요약
   */
  static summarizeForAI(
    transactions: MFTransaction[],
    maxTransactions: number = 100
  ): {
    summary: string;
    fullData: MFTransaction[];
    sampledData: MFTransaction[];
  } {
    const cleaned = this.clean(transactions);

    // 거래가 적으면 전체 사용
    if (cleaned.length <= maxTransactions) {
      return {
        summary: `전체 ${cleaned.length}건의 거래`,
        fullData: cleaned,
        sampledData: cleaned,
      };
    }

    // 거래가 많으면 샘플링
    // 1. 최근 50건
    const recent = cleaned.slice(0, 50);

    // 2. 고액 거래 30건
    const highValue = [...cleaned]
      .sort((a, b) => b.amount - a.amount)
      .slice(0, 30);

    // 3. 카테고리별 대표 거래
    const byCategoryMap = new Map<string, MFTransaction[]>();
    cleaned.forEach((t) => {
      const category = t.category.name;
      if (!byCategoryMap.has(category)) {
        byCategoryMap.set(category, []);
      }
      byCategoryMap.get(category)!.push(t);
    });

    const categoryReps = Array.from(byCategoryMap.values())
      .map((txns) => txns[0]) // 각 카테고리에서 첫 번째 거래
      .slice(0, 20);

    // 중복 제거 및 병합
    const sampledMap = new Map<string, MFTransaction>();
    [...recent, ...highValue, ...categoryReps].forEach((t) => {
      sampledMap.set(t.id, t);
    });

    const sampledData = Array.from(sampledMap.values()).slice(
      0,
      maxTransactions
    );

    return {
      summary: `${cleaned.length}건 중 대표 ${sampledData.length}건 샘플링`,
      fullData: cleaned,
      sampledData,
    };
  }

  /**
   * 카테고리 통계 생성
   */
  static getCategoryStats(transactions: MFTransaction[]) {
    const categoryMap = new Map<
      string,
      { amount: number; count: number; transactions: MFTransaction[] }
    >();

    transactions.forEach((t) => {
      const category = t.category.name;
      if (!categoryMap.has(category)) {
        categoryMap.set(category, { amount: 0, count: 0, transactions: [] });
      }

      const stats = categoryMap.get(category)!;
      stats.amount += t.amount;
      stats.count += 1;
      stats.transactions.push(t);
    });

    return Array.from(categoryMap.entries())
      .map(([name, stats]) => ({
        name,
        amount: stats.amount,
        count: stats.count,
        avgAmount: stats.amount / stats.count,
        transactions: stats.transactions,
      }))
      .sort((a, b) => b.amount - a.amount);
  }

  /**
   * 시간대별 패턴 분석
   */
  static getTimePatterns(transactions: MFTransaction[]) {
    const dateMap = new Map<string, number>();

    transactions.forEach((t) => {
      const date = t.date;
      dateMap.set(date, (dateMap.get(date) || 0) + 1);
    });

    return Array.from(dateMap.entries())
      .map(([date, count]) => ({ date, count }))
      .sort((a, b) => b.count - a.count);
  }

  /**
   * 이상치 탐지 (통계적 방법)
   */
  static detectOutliers(transactions: MFTransaction[]): MFTransaction[] {
    if (transactions.length < 10) return [];

    const amounts = transactions.map((t) => t.amount);
    const mean = amounts.reduce((a, b) => a + b, 0) / amounts.length;
    const variance =
      amounts.reduce((sum, val) => sum + Math.pow(val - mean, 2), 0) /
      amounts.length;
    const stdDev = Math.sqrt(variance);

    // 평균 ± 2 표준편차를 벗어나는 거래
    const threshold = 2;
    return transactions.filter(
      (t) => Math.abs(t.amount - mean) > threshold * stdDev
    );
  }
}
```

## 최적화된 프롬프트 빌더

### src/utils/prompt.ts 업데이트

```typescript
export class PromptBuilder {
  /**
   * 최적화된 재무 요약 프롬프트
   */
  static buildOptimizedSummaryPrompt(
    transactions: MFTransaction[],
    fromDate: string,
    toDate: string
  ): string {
    // 데이터 전처리
    const { summary, sampledData } = DataPreprocessor.summarizeForAI(
      transactions,
      100
    );

    const stats = TransactionAnalyzer.getBasicStats(transactions);
    const categoryStats = DataPreprocessor.getCategoryStats(
      transactions.filter((t) => !t.is_income)
    );

    const outliers = DataPreprocessor.detectOutliers(transactions);

    const prompt = `
あなたは経験豊富なファイナンシャルアドバイザーです。
以下の家計データを分析し、実用的なアドバイスを提供してください。

## 📅 分析期間
${fromDate} ~ ${toDate}

## 📊 データ概要
${summary}

## 💰 財務サマリー
- 総収入: ¥${stats.totalIncome.toLocaleString()}
- 総支出: ¥${stats.totalExpense.toLocaleString()}
- 純資産変動: ¥${stats.netChange.toLocaleString()}
- 取引件数: ${stats.transactionCount}件
- 平均支出額: ¥${Math.round(stats.avgExpense).toLocaleString()}

## 🏷️ カテゴリ別支出 (TOP 10)
${categoryStats.slice(0, 10).map((c, i) => {
  const percentage = ((c.amount / stats.totalExpense) * 100).toFixed(1);
  return `${i + 1}. ${c.name}: ¥${c.amount.toLocaleString()} (${percentage}%, ${c.count}件, 平均¥${Math.round(c.avgAmount).toLocaleString()})`;
}).join('\n')}

${
  outliers.length > 0
    ? `\n## ⚠️ 高額取引 (異常値)\n${outliers
        .slice(0, 5)
        .map((t) => `- ${t.date}: ${t.content} ¥${t.amount.toLocaleString()} (${t.category.name})`)
        .join('\n')}`
    : ''
}

## 📝 代表的な取引サンプル
${Formatter.transactionsToText(sampledData.slice(0, 20))}

## 🎯 分析依頼
以下の項目について、簡潔かつ具体的に分析してください:

1. **一言サマリー** (1-2文)
   - この期間の財務状況を端的に要約

2. **主要インサイト** (3-5項目)
   - 特筆すべきポイント
   - データから読み取れる傾向
   - 注意すべき点

3. **カテゴリ分析**
   - 最も大きな支出カテゴリについてのコメント
   - 支出の適切性評価

4. **実践的アドバイス** (3-5項目)
   - 具体的な節約方法
   - 支出最適化の提案
   - すぐに実行できるアクション

**重要**:
- 簡潔で読みやすい文章で書いてください
- 具体的な数値を示してください
- 実行可能なアドバイスを優先してください
- 絵文字を適度に使用してください
    `.trim();

    return prompt;
  }

  /**
   * 컨텍스트 윈도우 최적화된 상세 분석
   */
  static buildCompactDetailedPrompt(
    transactions: MFTransaction[],
    fromDate: string,
    toDate: string
  ): string {
    const stats = TransactionAnalyzer.getBasicStats(transactions);
    const categoryStats = DataPreprocessor.getCategoryStats(
      transactions.filter((t) => !t.is_income)
    );

    const timePatterns = DataPreprocessor.getTimePatterns(transactions);

    // 통계 중심으로 컴팩트하게 구성
    const prompt = `
家計データ分析レポートを作成してください。

期間: ${fromDate} ~ ${toDate}

財務指標:
- 収入: ¥${stats.totalIncome.toLocaleString()}
- 支出: ¥${stats.totalExpense.toLocaleString()}
- 差額: ¥${stats.netChange.toLocaleString()}

カテゴリ別(TOP 5):
${categoryStats.slice(0, 5).map((c) => `${c.name}: ¥${c.amount.toLocaleString()}`).join(', ')}

最も取引が多かった日:
${timePatterns.slice(0, 3).map((p) => `${p.date}(${p.count}件)`).join(', ')}

レポート形式:
1. 概要(2-3文)
2. 主要発見(3項目)
3. 推奨事項(3項目)

簡潔に、ビジネスライクに書いてください。
    `.trim();

    return prompt;
  }
}
```

## 프롬프트 A/B 테스트

### src/cli-test-prompts.ts

```typescript
import 'dotenv/config';
import { MoneyForwardClient } from './moneyforward/client';
import { FinancialAnalyzer } from './vertexai/analyzer';
import { VertexAIClient } from './vertexai/client';
import { PromptBuilder, DataPreprocessor } from './utils';
import { format, subMonths } from 'date-fns';

async function main() {
  console.log('🧪 프롬프트 A/B 테스트\n');

  const mfClient = new MoneyForwardClient();
  const aiClient = new VertexAIClient();

  const toDate = format(new Date(), 'yyyy-MM-dd');
  const fromDate = format(subMonths(new Date(), 1), 'yyyy-MM-dd');

  console.log(`📅 기간: ${fromDate} ~ ${toDate}\n`);

  const transactions = await mfClient.getAllTransactions(fromDate, toDate);

  if (transactions.length === 0) {
    console.log('거래 내역이 없습니다');
    return;
  }

  // Test A: 기본 프롬프트
  console.log('='.repeat(60));
  console.log('📋 Test A: 기본 프롬프트');
  console.log('='.repeat(60) + '\n');

  const promptA = PromptBuilder.buildSummaryPrompt(
    transactions,
    fromDate,
    toDate
  );

  console.log(`토큰 추정: ${Math.ceil(promptA.length / 4)}\n`);

  const responseA = await aiClient.generateText({
    prompt: promptA,
    temperature: 0.3,
  });

  console.log(responseA.text);
  console.log('\n');

  // Test B: 최적화된 프롬프트
  console.log('='.repeat(60));
  console.log('📋 Test B: 최적화된 프롬프트');
  console.log('='.repeat(60) + '\n');

  const promptB = PromptBuilder.buildOptimizedSummaryPrompt(
    transactions,
    fromDate,
    toDate
  );

  console.log(`토큰 추정: ${Math.ceil(promptB.length / 4)}\n`);

  const responseB = await aiClient.generateText({
    prompt: promptB,
    temperature: 0.3,
  });

  console.log(responseB.text);
  console.log('\n');

  // 비교
  console.log('='.repeat(60));
  console.log('📊 비교 결과');
  console.log('='.repeat(60));
  console.log(`프롬프트 A 길이: ${promptA.length}자`);
  console.log(`프롬프트 B 길이: ${promptB.length}자`);
  console.log(`차이: ${promptA.length - promptB.length}자 (${((1 - promptB.length / promptA.length) * 100).toFixed(1)}% 감소)`);
}

main();
```

## 캐싱 전략

### src/utils/cache.ts

```typescript
import fs from 'fs/promises';
import path from 'path';
import crypto from 'crypto';

export class AnalysisCache {
  private cacheDir: string;

  constructor(cacheDir: string = '.cache') {
    this.cacheDir = cacheDir;
  }

  /**
   * 캐시 키 생성
   */
  private getCacheKey(data: any): string {
    const hash = crypto.createHash('md5');
    hash.update(JSON.stringify(data));
    return hash.digest('hex');
  }

  /**
   * 캐시 저장
   */
  async set(key: string, value: any, ttl: number = 3600000): Promise<void> {
    await fs.mkdir(this.cacheDir, { recursive: true });

    const cacheData = {
      value,
      expiresAt: Date.now() + ttl,
    };

    const filePath = path.join(this.cacheDir, `${key}.json`);
    await fs.writeFile(filePath, JSON.stringify(cacheData));
  }

  /**
   * 캐시 조회
   */
  async get<T>(key: string): Promise<T | null> {
    try {
      const filePath = path.join(this.cacheDir, `${key}.json`);
      const data = await fs.readFile(filePath, 'utf-8');
      const cacheData = JSON.parse(data);

      // 만료 확인
      if (cacheData.expiresAt < Date.now()) {
        await fs.unlink(filePath);
        return null;
      }

      return cacheData.value as T;
    } catch {
      return null;
    }
  }

  /**
   * 분석 결과 캐싱
   */
  async cacheAnalysis(
    transactions: any[],
    fromDate: string,
    toDate: string,
    result: string
  ): Promise<void> {
    const key = this.getCacheKey({ transactions, fromDate, toDate });
    await this.set(`analysis-${key}`, result, 3600000); // 1시간
  }

  /**
   * 캐시된 분석 결과 조회
   */
  async getCachedAnalysis(
    transactions: any[],
    fromDate: string,
    toDate: string
  ): Promise<string | null> {
    const key = this.getCacheKey({ transactions, fromDate, toDate });
    return await this.get<string>(`analysis-${key}`);
  }
}
```

## 체크리스트

v8을 완료하기 전에 다음을 확인하세요:

- [ ] `DataPreprocessor` 클래스 구현
- [ ] 최적화된 프롬프트 템플릿 작성
- [ ] 토큰 사용량 감소 확인
- [ ] A/B 테스트 스크립트 실행
- [ ] 캐싱 전략 구현
- [ ] 분석 품질 향상 확인

## 다음 단계

v9에서는 전체 워크플로우를 통합하고 CLI 인터페이스를 완성합니다.

---

**작성일**: 2025-11-30
**상태**: ✅ 완료
**다음**: v9 - 통합 및 CLI 인터페이스