Run Claude Code using NVIDIA's hosted inference API instead of direct Anthropic API access. This is useful for teams with NVIDIA API access who want to use Claude Code without individual Anthropic API keys.
┌─────────────┐ ┌─────────────┐ ┌──────────────────────┐
│ Claude Code │────▶│ LiteLLM │────▶│ NVIDIA Inference API │
│ (Anthropic │ │ Proxy │ │ (Sonnet / Opus) │
│ format) │ │ (localhost) │ │ │
└─────────────┘ └─────────────┘ └──────────────────────┘
- Claude Code expects Anthropic's API format (
/v1/messages) - NVIDIA's API uses OpenAI-compatible format (
/chat/completions) - LiteLLM translates between the two formats
- NVIDIA API Key from inference.nvidia.com/key-management
- Python 3.10+
- Basic tools:
curl,python3(pre-installed on most systems)
git clone https://github.com/dburkhardt/claude-code-nvidia-inference.git
cd claude-code-nvidia-inference
source scripts/setup_env.shThe setup script will:
- Install
uv(fast Python package manager) - Install Claude Code CLI
- Prompt for your NVIDIA API key (saved for future sessions)
- Install and start the LiteLLM proxy
- Configure Claude Code to use the proxy
claudeThat's it! Claude Code will route through the NVIDIA inference API.
| NVIDIA Model | Claude Code Usage |
|---|---|
| Sonnet 4.5 (default) | claude |
| Opus 4.5 | claude --model claude-opus-4-5-20250929 |
After initial setup, just run:
cd claude-code-nvidia-inference
source scripts/setup_env.sh # Loads saved API key, starts proxy if needed
claudeThe script is idempotent - it detects if the proxy is already running.
Maps Claude Code model requests to NVIDIA's hosted Claude models:
claude-sonnet-4-5-*→aws/anthropic/bedrock-claude-sonnet-4-5-v1claude-opus-4-5-*→aws/anthropic/claude-opus-4-5- Haiku requests → Sonnet (Haiku not available on NVIDIA)
{
"env": {
"ANTHROPIC_API_KEY": "sk-litellm-local-dev",
"ANTHROPIC_BASE_URL": "http://localhost:4000"
}
}NVIDIA's Bedrock-hosted Claude models have a smaller context window than the direct Anthropic API (~100K vs 200K tokens). The litellm_config.yaml is configured with max_input_tokens: 100000 to enable pre-call validation, allowing Claude Code to trigger context compaction before hitting the API limit.
This limit was determined empirically - the actual NVIDIA limit is approximately 111K tokens, so 100K provides a safety margin.
- Programmatic mode recommended - Use
-pflag for non-interactive usage - Some features may vary - Tool use, streaming, and advanced features route through LiteLLM
- Additional latency - Extra hop through LiteLLM proxy adds ~50-100ms
Auth conflict: Both a token (ANTHROPIC_AUTH_TOKEN) and an API key (ANTHROPIC_API_KEY) are set.
Fix: Run claude /logout and unset ANTHROPIC_AUTH_TOKEN
Possible causes:
NVIDIA_API_KEYisn't set correctly - check withecho $NVIDIA_API_KEY- LiteLLM proxy isn't running - check with
curl http://localhost:4000/health
Ensure:
- LiteLLM proxy is running (
curl http://localhost:4000/health) - You've logged out of any existing Claude authentication (
claude /logout)
# Check what's using port 4000
lsof -i :4000
# Kill it if needed
lsof -ti :4000 | xargs kill -9
# Restart
source scripts/setup_env.sh --restartRun the test script to verify everything works:
./scripts/test_nvidia_endpoint.shMIT License