The Smart Gateway
for LLM Infrastructure
Route requests based on intent, cost, and latency. Maximize resilience with automatic fallbacks and local embedding-based classification.
Semantic Routing
Locally run ONNX embeddings to classify intent. Route 'coding' to Claude 3.5 Sonnet and 'chit-chat' to GPT-4o-mini automatically.
Cost Controls
Set daily budgets per provider. Automatically skip providers that exceed limits to prevent overspending and ensure budget predictability.
Docker Ready
Fully containerized architecture. Deploy instantly with Docker Compose and scale horizontally thanks to stateless Redis-backed tracking.
How Octo Router Works
Receive Request
Your app sends a standard OpenAI-compatible chat completion request.
Route & Optimize
Octo Router classifies intent, checks budgets, and selects the best provider.
Proxy Response
The request is forwarded, and the response is streamed back with usage stats.
Configuration as Code
Define your routing logic, budgets, and fallbacks in a simple, clear YAML file. No complex UIs or proprietary databases required.
- Git-ops friendly configuration
- Hot-reloading without downtime
- Environment variable substitution
routing:
strategy: "weighted"
policies:
semantic:
enabled: true
engine: "embedding"
model_path: "assets/models/embedding.onnx"
groups:
- name: "coding"
required_capability: "code-gen"
allow_providers: ["openai", "anthropic"]Built for Scalable AI Teams
SaaS Tiered Pricing
Route requests to different models based on user priority or tier. Enforce strict budget limits per provider to maintain healthy margins.
Internal Developer Platform
Give every engineer a unified API endpoint. Monitor usage by team and prevent accidental overspending.
High-Traffic Consumer App
Monitor performance metrics in real-time. Use weighted routing to distribute load across multiple providers ensuring high availability.
Frequently Asked Questions
Does Octo Router add latency?
Minimal overhead. It's written in Go and uses a highly optimized routing pipeline.
Can I use it with any LLM?
For now it's only compactible with OpenAI, Gemini & Anthropic providers
How does semantic routing work locally?
We embed the ONNX runtime directly in the binary. This allows us to run quantization-optimized models (like all-MiniLM-L6-v2) on the CPU in milliseconds.