OpenAI GPT OSS：新しいオープンソースモデルファミリー - 詳細要約

"GPT OSS is a hugely anticipated open-weights release by OpenAI, designed for powerful reasoning, agentic tasks, and versatile developer use cases."

モデル構成と技術仕様

2つのモデルサイズ

gpt-oss-120b: 117Bパラメータ（アクティブパラメータ5.1B）
gpt-oss-20b: 21Bパラメータ（アクティブパラメータ3.6B）

主要技術特徴

MoE（Mixture-of-Experts）アーキテクチャ
**4-bit量子化スキーム（MXFP4）**による高速推論
Apache 2.0ライセンスで提供
120Bモデルは単一H100 GPU（80GB）で動作
20Bモデルは16GBメモリで動作可能
推論、テキストのみのモデル（Chain-of-Thoughtと調整可能な推論努力レベル対応）
指示追従とツール使用をサポート

アーキテクチャ詳細

"Token-choice MoE with SwiGLU activations. When calculating the MoE weights, a softmax is taken over selected experts (softmax-after-topk). Each attention layer uses RoPE with 128K context."

SwiGLU活性化を持つToken-choice MoE
選択されたエキスパートに対するsoftmax（softmax-after-topk）
各attention層でRoPEと128Kコンテキストを使用
フルコンテキストと128トークンスライディングウィンドウの交互attention層
ヘッドごとの学習されたattention sink
GPT-4oと同じトークナイザーを使用
Responses API互換性のための新しいトークンを組み込み

API利用：Inference Providers

Python例（Cerebrasプロバイダー）

import os
from openai import OpenAI

client = OpenAI(
    base_url="https://router.huggingface.co/v1",
    api_key=os.environ["HF_TOKEN"],
)

completion = client.chat.completions.create(
    model="openai/gpt-oss-120b:cerebras",
    messages=[
        {
            "role": "user",
            "content": "How many rs are in the word 'strawberry'?",
        }
    ],
)
print(completion.choices[0].message)

Responses API例（Fireworks AIプロバイダー）

import os
from openai import OpenAI

client = OpenAI(
    base_url="https://router.huggingface.co/v1",
    api_key=os.getenv("HF_TOKEN"),
)

response = client.responses.create(
    model="openai/gpt-oss-20b:fireworks-ai",
    input="How many rs are in the word 'strawberry'?",
)
print(response)

ローカル推論

Transformersライブラリでの使用

基本セットアップ

pip install --upgrade accelerate transformers kernels

基本推論例（20Bモデル）

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "openai/gpt-oss-20b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto",
)

messages = [
    {"role": "user", "content": "How many rs are in the word 'strawberry'?"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

generated = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:]))

PyTorch 2.8とTriton 3.4のインストール（オプション）

# オプション：PyTorch 2.8が必要な場合
pip install torch==2.8.0 --index-url https://download.pytorch.org/whl/test/cu128

# mxfp4サポート用のtritonカーネルをインストール
pip install git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels

最適化オプション

Flash Attention 3（Hopperカード用）

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "openai/gpt-oss-20b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto",
    # Flash Attention with Sinks
    attn_implementation="kernels-community/vllm-flash-attn3",
)

messages = [
    {"role": "user", "content": "How many rs are in the word 'strawberry'?"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

generated = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:]))

MegaBlocks MoE カーネル（その他のGPU用）

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "openai/gpt-oss-20b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto",
    # MoE層をダウンロード可能なMegaBlocksMoEMLPで最適化
    use_kernels=True,
)

messages = [
    {"role": "user", "content": "How many rs are in the word 'strawberry'?"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

generated = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:]))

GPU互換性とおすすめ最適化

| GPU種類 | mxfp4 | Flash Attention 3 | MegaBlocks MoE | |---------|-------|-------------------|----------------| | Hopper GPUs (H100, H200) | ✅ | ✅ | ❌ | | Blackwell GPUs (GB200, 50xx, RTX Pro 6000) | ✅ | ❌ | ❌ | | その他のCUDA GPUs | ❌ | ❌ | ✅ | | AMD Instinct (MI3XX) | ❌ | ❌ | ✅ |

マルチGPU推論例（120Bモデル）

# torchrun --nproc_per_node=4 generate.py で実行
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.distributed import DistributedConfig
import torch

model_path = "openai/gpt-oss-120b"
tokenizer = AutoTokenizer.from_pretrained(model_path, padding_side="left")

device_map = {
    "tp_plan": "auto",  # Tensor Parallelismを有効化
}

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype="auto",
    attn_implementation="kernels-community/vllm-flash-attn3",
    **device_map,
)

messages = [
    {"role": "user", "content": "Explain how expert parallelism works in large language models."}
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=1000)

# デコードと出力
response = tokenizer.decode(outputs[0])
print("Model response:", response.split("<|channel|>final<|message|>")[-1].strip())

その他の推論オプション

Llama.cpp

# MacOS
brew install llama.cpp

# Windows
winget install llama.cpp

# サーバー起動
llama-server -hf ggml-org/gpt-oss-120b-GGUF -c 0 -fa --jinja --reasoning-format none
# その後、http://localhost:8080 にアクセス

vLLM

# サーバー起動（2つのH100 GPUを想定）
vllm serve openai/gpt-oss-120b --tensor-parallel-size 2

# Python直接使用
from vllm import LLM

llm = LLM("openai/gpt-oss-120b", tensor_parallel_size=2)
output = llm.generate("San Francisco is a")

transformers serve

transformers serve

Responses API使用例

curl -X POST http://localhost:8000/v1/responses \
  -H "Content-Type: application/json" \
  -d '{"input": [{"role": "system", "content": "hello"}], "temperature": 1.0, "stream": true, "model": "openai/gpt-oss-120b"}'

Completions API使用例

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "system", "content": "hello"}], "temperature": 1.0, "max_tokens": 1000, "stream": true, "model": "openai/gpt-oss-120b"}'

ファインチューニング

GPT OSSモデルはTRL（Transformers Reinforcement Learning）ライブラリと完全統合されています。

提供されているリソース：

OpenAI cookbookでのLoRA例（多言語推論のファインチューニング）
基本的なファインチューニングスクリプト

パートナーシップとデプロイメント

Azure AI Model Catalog

Hugging FaceとAzureの協力により、GPT OSSモデルがAzure AI Model Catalogで利用可能になりました：

GPT OSS 20B
GPT OSS 120B
エンタープライズグレードインフラでの管理されたオンラインエンドポイント
オートスケーリングとモニタリング機能

Dell Enterprise Hub

Dell プラットフォームでのオンプレミス展開が可能：

最適化されたコンテナ
Dellハードウェアのネイティブサポート
エンタープライズグレードセキュリティ機能

モデル評価

評価時の注意点

"GPT OSS models are reasoning models: they therefore require a very large generation size (maximum number of new tokens) for evaluations, as their generation will first contain reasoning, then the actual answer."

推論モデルのため、評価時は大きな生成サイズが必要です。

lighteval使用例

git clone https://github.com/huggingface/lighteval
pip install -e .[dev]

lighteval accelerate \
  "model_name=openai/gpt-oss-20b,max_length=16384,skip_special_tokens=False,generation_parameters={temperature:1,top_p:1,top_k:40,min_p:0,max_new_tokens:16384}" \
  "extended|ifeval|0|0,lighteval|aime25|0|0" \
  --save-details --output-dir "openai_scores" \
  --remove-reasoning-tags --reasoning-tags="[('<|channel|>analysis<|message|>','<|end|><|start|>assistant<|channel|>final<|message|>')]"

ベンチマーク結果（20Bモデル）

IFEval (strict prompt): 69.5 (+/-1.9)
AIME25 (pass@1): 63.3 (+/-8.9)

チャットテンプレートと特殊機能

チャンネル概念

GPT OSSは出力で「チャンネル」を使用します：

"Most of the time, you will see an 'analysis' channel that contains things that are not intended to be sent to the end-user, like chains of thought, and a 'final' channel containing messages that are actually intended to be displayed to the user."

出力構造例：

<|start|>assistant<|channel|>analysis<|message|>CHAIN_OF_THOUGHT<|end|><|start|>assistant<|channel|>final<|message|>ACTUAL_MESSAGE

訓練時のチャットフォーマット

chat = [
    {"role": "user", "content": "Hi there!"},
    {"role": "assistant", "content": "Hello!"},
    {"role": "user", "content": "Can you think about this one?"},
    {"role": "assistant", "thinking": "Thinking real hard...", "content": "Okay!"}
]

# add_generation_prompt=Falseは通常訓練時のみ使用
inputs = tokenizer.apply_chat_template(chat, add_generation_prompt=False)

システムメッセージと開発者メッセージ

GPT OSSは「system」メッセージと「developer」メッセージを区別します：

chat = [
    {"role": "system", "content": "This will actually become a developer message!"}
]

tokenizer.apply_chat_template(
    chat,
    model_identity="You are OpenAI GPT OSS.",
    reasoning_effort="high"  # "medium"がデフォルト、"high"と"low"も可能
)

ツール使用

基本的なツール定義例

def get_current_weather(location: str):
    """
    Returns the current weather status at a given location as a string.
    
    Args:
        location: The location to get the weather for.
    """
    return "Terrestrial."  # 簡単な例

chat = [
    {"role": "user", "content": "What's the weather in Paris right now?"}
]

inputs = tokenizer.apply_chat_template(
    chat,
    tools=[get_current_weather],
    builtin_tools=["browser", "python"],
    add_generation_prompt=True,
    return_tensors="pt"
)

ツール呼び出しの処理

# モデルがツールを呼び出す場合（<|call|>で終わる）
tool_call_message = {
    "role": "assistant",
    "tool_calls": [
        {
            "type": "function",
            "function": {
                "name": "get_current_temperature",
                "arguments": {"location": "Paris, France"}
            }
        }
    ]
}

chat.append(tool_call_message)

tool_output = get_current_weather("Paris, France")

tool_result_message = {
    # GPT OSSは一度に1つのツールしか呼び出さないため、
    # ツールメッセージに追加のメタデータは不要
    "role": "tool",
    "content": tool_output
}

chat.append(tool_result_message)

# 再度apply_chat_template()とgenerate()を実行可能

意義と今後の展望

"According to OpenAI, this release is a meaningful step in their commitment to the open-source ecosystem, in line with their stated mission to make the benefits of AI broadly accessible."

OpenAIのミッション「AIの恩恵を広くアクセス可能にする」に沿った重要なステップです。プライベート/ローカル展開が必要な多くのユースケースに対応し、コミュニティにとって長期的で影響力のあるモデルになることが期待されています。