Production-ready AI Gateway

The Smart Gateway
for LLM Infrastructure

Route requests based on intent, cost, and latency. Maximize resilience with automatic fallbacks and local embedding-based classification.

🧠

Semantic Routing

Locally run ONNX embeddings to classify intent. Route 'coding' to Claude 3.5 Sonnet and 'chit-chat' to GPT-4o-mini automatically.

💰

Cost Controls

Set daily budgets per provider. Automatically skip providers that exceed limits to prevent overspending and ensure budget predictability.

🐳

Docker Ready

Fully containerized architecture. Deploy instantly with Docker Compose and scale horizontally thanks to stateless Redis-backed tracking.

How Octo Router Works

1

Receive Request

Your app sends a standard OpenAI-compatible chat completion request.

2

Route & Optimize

Octo Router classifies intent, checks budgets, and selects the best provider.

3

Proxy Response

The request is forwarded, and the response is streamed back with usage stats.

Configuration as Code

Define your routing logic, budgets, and fallbacks in a simple, clear YAML file. No complex UIs or proprietary databases required.

  • Git-ops friendly configuration
  • Hot-reloading without downtime
  • Environment variable substitution
config.yaml
routing:
  strategy: "weighted"
  policies:
    semantic:
      enabled: true
      engine: "embedding"
      model_path: "assets/models/embedding.onnx"
      groups:
        - name: "coding"
          required_capability: "code-gen"
          allow_providers: ["openai", "anthropic"]

Built for Scalable AI Teams

SaaS Tiered Pricing

Route requests to different models based on user priority or tier. Enforce strict budget limits per provider to maintain healthy margins.

Internal Developer Platform

Give every engineer a unified API endpoint. Monitor usage by team and prevent accidental overspending.

High-Traffic Consumer App

Monitor performance metrics in real-time. Use weighted routing to distribute load across multiple providers ensuring high availability.

Frequently Asked Questions

Does Octo Router add latency?

Minimal overhead. It's written in Go and uses a highly optimized routing pipeline.

Can I use it with any LLM?

For now it's only compactible with OpenAI, Gemini & Anthropic providers

How does semantic routing work locally?

We embed the ONNX runtime directly in the binary. This allows us to run quantization-optimized models (like all-MiniLM-L6-v2) on the CPU in milliseconds.