Products / For AI builders
InferX
AvailableOpenAI- and Anthropic-compatible inference on your GPUs — measured to the token.
InferX is an inference platform you run on your own Kubernetes. It speaks both OpenAI and Anthropic Messages APIs on the front, deploys vLLM, llama.cpp, or text-embeddings-inference from built-in presets — or any KServe ServingRuntime you bring — on the back, and adds the operational layer most gateways skip: one logical model backed by multiple deployments with load balancing and three-state circuit breakers, cost attribution per user and key, P50/P95/P99 latency per model, full KServe lifecycle management from templates, model downloads from Hugging Face and S3, and a playground with streaming, thinking-mode rendering, and MCP tool calling. Beyond chat it now serves embeddings and /v1/rerank, audio over /v1/audio/speech (TTS) and /v1/audio/transcriptions (ASR), vision-language models, and KServe v2 / Open Inference Protocol inference — and serving instances pause and resume to release idle GPUs. The roadmap adds policy-based routing and safety modes for control-loop workloads.
Specification
- Version
- v2.15 — generally available
- Protocols
- OpenAI (chat · embeddings · rerank · audio) · Anthropic Messages · KServe v2 · streaming SSE
- Runtimes
- vLLM · llama.cpp · text-embeddings-inference presets · any KServe ServingRuntime
- Hardware
- NVIDIA · AMD · Intel · Ascend · Cambricon — auto-detected
- Routing
- Multi-deployment per model · weighted balancing · circuit breakers
Proof, not promises
See it in one block.
No proprietary SDKs, no rewrites — InferX meets your tools where they already are.
from openai import OpenAI
client = OpenAI(
base_url="https://inferx.intra.example/api", # ← the change
api_key=os.environ["INFERX_API_KEY"],
)
# Anthropic SDKs and claude-code work the same way via /anthropic/v1
# every request lands in the dashboard: cost · P50/P95/P99 · errors▌ Drop-in for both OpenAI and Anthropic SDKs, streaming included — with per-key budgets, rate limits, and model allowlists.
Capabilities
What InferX gives you
OpenAI- and Anthropic-compatible — and multimodal
Drop-in /v1/chat/completions, /v1/embeddings, /v1/rerank, /v1/audio/speech and /v1/audio/transcriptions, plus /anthropic/v1/messages and KServe v2 /v2/models/:model/infer — all with streaming SSE. Point your existing SDK or claude-code at InferX by changing the base URL. Native in-process providers — no proxy hop in the request path.
Multi-vendor GPU and KServe-native
Auto-detected NVIDIA, AMD, Intel, Huawei Ascend, and Cambricon. Deploy InferenceServices from typed templates — vLLM presets (AWQ, BF16), GGUF via llama.cpp, text-embeddings-inference, and vision-language models, or any ServingRuntime you bring — then drill from service to pod to logs without leaving the UI. Pause an idle deployment to release its GPUs and resume it on demand.
Cost, latency, and errors per model
Every request is OTEL-instrumented. P50/P95/P99 latency, error rate, and token-level cost attributed per model, per user, and per API key — with budgets, rate limits, and automatic suspension at zero balance.
Built for agents
One model, many deployments: weighted load balancing with three-state circuit breakers and pre-first-token failover today. The playground speaks MCP and renders thinking modes. Roadmap: session affinity, policy routing, and verified / consensus / human-in-the-loop safety modes.
How it works
From model weights to a measured endpoint.
- Step 01
Deploy a model
Pick a runtime template — vLLM or GGUF presets, or your own ServingRuntime — point at a PVC of weights, hit deploy. Multi-vendor GPU auto-detected.
- Step 02
Get an endpoint
OpenAI- and Anthropic-compatible URLs, streaming SSE on both. API keys with rate limits, budgets, and per-model allowlists.
- Step 03
Watch cost and latency
Every request OTEL-instrumented. P50/P95/P99, error rate, and cost attributed per model and per key — straight from the dashboard.
Who it's for
Built for these teams
- Teams shipping LLM products on dedicated capacity
- Platform teams consolidating inference cost and access
- Builders of agentic systems with safety and audit needs
Pairs well with
Other builder products
ConsoleX
AvailableLog in, get a governed Kubernetes workspace. No kubectl, no tickets.
On first SSO login every user gets an isolated namespace with quotas, default-deny networking, storage, and a web terminal — provisioned automatically, reconciled continuously.
Learn moreDevSpace
AvailableJupyter or VS Code on a GPU in seconds. Idle environments shut themselves down.
Single-click Jupyter, Marimo, Streamlit, Gradio, and VS Code environments — GPU-ready, isolated per user behind a per-pod auth proxy, with SSH access and idle shutdown by default.
Learn moreTrainX
AvailableAdmins write the template. Users fill a form. Kubernetes runs the job.
Self-describing training templates render straight into UI forms — with live quota checks, streaming logs, parsed progress bars, and one-click TensorBoard.
Learn more