Text-to-Image

学習目標: テキストから画像を生成するモデル（Stable Diffusion / DALL-E系）の仕組みと、効果的なプロンプトの書き方を理解する

Text-to-Image とは

Text-to-Image モデルは、自然言語の説明（プロンプト）から画像を生成するAIです。代表例: Stable Diffusion, DALL-E, Midjourney, Imagen。

🌟 仕組みの3ステップ

1テキストエンコード

CLIPなどのテキストエンコーダで、プロンプトを意味ベクトル（埋め込み）に変換

2画像生成

拡散モデルがノイズから少しずつ画像を生成。テキストベクトルで条件付けする

3デコード

VAE デコーダで潜在空間 → ピクセル空間に復号して最終画像

📐 アーキテクチャ（Stable Diffusion）

[ プロンプト ]
       │
       ▼
[ CLIPテキストエンコーダ ]   ←─ 意味ベクトル
       │
       ▼
[ U-Net ]   ←──── ランダムノイズ
       │   (50ステップで少しずつデノイズ)
       ▼
[ 潜在表現 (64×64) ]
       │
       ▼
[ VAEデコーダ ]
       │
       ▼
[ 画像 (512×512) ]

🔑 重要な要素

要素	役割	代表例
テキストエンコーダ	意味の数値化	CLIP, T5
拡散モデル	ノイズから画像生成	DDPM, U-Net
潜在空間	計算量を抑える	VAE 圧縮 (8倍)
条件付け	テキストで生成方向を制御	Cross-Attention

💡 プロンプト例

定番のプロンプトを5カテゴリで紹介します。コピーしてStable Diffusionなどの環境で試してみてください。

🌅 風景

A serene mountain landscape at sunset,
golden hour lighting, dramatic clouds,
photorealistic, 8k, highly detailed

👤 ポートレート

Portrait of a young woman with long flowing hair,
soft natural lighting, studio portrait,
shallow depth of field, professional photography

🎨 アート

Abstract digital art with vibrant colors,
flowing shapes, modern composition,
inspired by Kandinsky and Mondrian

🐈 動物

A cute fluffy kitten playing with yarn,
warm indoor lighting, cozy atmosphere,
photorealistic, adorable expression

🚀 SF

Futuristic cyberpunk cityscape at night,
neon lights, flying cars, rain-soaked streets,
cinematic atmosphere, Blade Runner style

🍳 食べ物

Gourmet sushi platter with fresh sashimi,
artistic presentation, soft studio lighting,
food photography, mouth-watering

📋 プロンプトの基本構造

[主題] + [詳細・属性] + [スタイル] + [品質指定] + [カメラ・照明]

例:
A majestic dragon       ← 主題
with golden scales,     ← 詳細
fantasy art style,      ← スタイル
highly detailed, 8k,    ← 品質
dramatic lighting       ← 照明

🎨 修飾語の効果

カテゴリ	例	効果
品質	highly detailed, masterpiece, 8k, ultra realistic	細部のクオリティ向上
スタイル	oil painting, watercolor, anime, photorealistic	絵柄を変える
照明	golden hour, dramatic lighting, soft natural	雰囲気を変える
カメラ	wide angle, macro, telephoto, 35mm	視点を変える
感情	melancholic, joyful, mysterious, energetic	感情的トーン

🎭 ネガティブプロンプト

「これは避けたい」要素を別フィールドに書く。多くのモデルで使える強力なテクニック。

Positive: A beautiful landscape photo
Negative: blurry, low quality, watermark, text, oversaturated, distorted

🔀 重み付け

キーワードに (keyword:1.3) のように重みを付けて強調できる（モデルにより記法が異なる）。

A landscape with (dramatic lighting:1.4) and (vibrant colors:1.2),
[boring composition:0.7]

📚 Text-to-Image モデルの学習

Text-to-Image モデルは、(画像, テキストキャプション) のペアを大量に学習します。代表的なデータセットは LAION-5B（58億ペア）。

Stage 1: テキスト-画像の意味アライメント (CLIP)

CLIP (Contrastive Language-Image Pre-training) は、画像とそのキャプションを同じ意味空間にマッピングします。

# CLIPの対照学習（疑似コード）
for batch in dataloader:
    images, texts = batch
    image_emb = image_encoder(images)   # (N, D)
    text_emb  = text_encoder(texts)     # (N, D)

    # 正規化
    image_emb = F.normalize(image_emb, dim=-1)
    text_emb  = F.normalize(text_emb, dim=-1)

    # 類似度行列
    logits = image_emb @ text_emb.T * temperature  # (N, N)

    # 対角が正解（同じインデックスのペアが一致）
    labels = torch.arange(N)
    loss = (F.cross_entropy(logits, labels) +
            F.cross_entropy(logits.T, labels)) / 2

    loss.backward()
    optimizer.step()

Stage 2: 拡散モデルの学習（テキスト条件付き）

# テキスト条件付き拡散モデルの学習ループ
for batch in dataloader:
    images, captions = batch

    # テキスト埋め込み（固定したCLIPを使用）
    with torch.no_grad():
        text_emb = clip_text_encoder(captions)

    # 画像を潜在空間に圧縮（VAE encoderを使用）
    z = vae.encode(images).latent

    # ランダムなタイムステップとノイズ
    t = torch.randint(0, T, (z.size(0),))
    noise = torch.randn_like(z)
    z_noisy = scheduler.add_noise(z, noise, t)

    # U-Net がノイズを予測（テキストで条件付け）
    pred_noise = unet(z_noisy, t, text_emb)

    loss = F.mse_loss(pred_noise, noise)
    loss.backward()
    optimizer.step()

規模の参考

モデル	パラメータ数	学習データ	学習コスト目安
Stable Diffusion v1.5	~860M (U-Net)	LAION 2B	~150K GPU時間 (A100)
DALL-E 2	~3.5B	非公開	非公開
Imagen	~3B	460M (内部)	非公開

※ 個人での学習は事実上不可能。事前学習済みモデルを ファインチューニング (LoRA / DreamBooth) するのが現実的。

実機で試す: 上のコードは PyTorch + diffusers ライブラリで動かせます。 Hugging Face の runwayml/stable-diffusion-v1-5 や、 Colab の Stable Diffusion ノートブックが入門に便利です。