Products / For AI builders

InferX

Available

OpenAI- and Anthropic-compatible inference on your GPUs — measured to the token.

InferX is an inference platform you run on your own Kubernetes. It speaks both OpenAI and Anthropic Messages APIs on the front, deploys vLLM, llama.cpp, or text-embeddings-inference from built-in presets — or any KServe ServingRuntime you bring — on the back, and adds the operational layer most gateways skip: one logical model backed by multiple deployments with load balancing and three-state circuit breakers, cost attribution per user and key, P50/P95/P99 latency per model, full KServe lifecycle management from templates, model downloads from Hugging Face and S3, and a playground with streaming, thinking-mode rendering, and MCP tool calling. Beyond chat it now serves embeddings and /v1/rerank, audio over /v1/audio/speech (TTS) and /v1/audio/transcriptions (ASR), vision-language models, and KServe v2 / Open Inference Protocol inference — and serving instances pause and resume to release idle GPUs. The roadmap adds policy-based routing and safety modes for control-loop workloads.

All products

Specification

Version: v2.15 — generally available
Protocols: OpenAI (chat · embeddings · rerank · audio) · Anthropic Messages · KServe v2 · streaming SSE
Runtimes: vLLM · llama.cpp · text-embeddings-inference presets · any KServe ServingRuntime
Hardware: NVIDIA · AMD · Intel · Ascend · Cambricon — auto-detected
Routing: Multi-deployment per model · weighted balancing · circuit breakers

Proof, not promises

See it in one block.

No proprietary SDKs, no rewrites — InferX meets your tools where they already are.

change one URL, keep your code

from openai import OpenAI
client = OpenAI(
    base_url="https://inferx.intra.example/api",  # ← the change
    api_key=os.environ["INFERX_API_KEY"],
)
# Anthropic SDKs and claude-code work the same way via /anthropic/v1
# every request lands in the dashboard: cost · P50/P95/P99 · errors▌

Drop-in for both OpenAI and Anthropic SDKs, streaming included — with per-key budgets, rate limits, and model allowlists.

Capabilities

What InferX gives you

OpenAI- and Anthropic-compatible — and multimodal

Drop-in /v1/chat/completions, /v1/embeddings, /v1/rerank, /v1/audio/speech and /v1/audio/transcriptions, plus /anthropic/v1/messages and KServe v2 /v2/models/:model/infer — all with streaming SSE. Point your existing SDK or claude-code at InferX by changing the base URL. Native in-process providers — no proxy hop in the request path.

Multi-vendor GPU and KServe-native

Auto-detected NVIDIA, AMD, Intel, Huawei Ascend, and Cambricon. Deploy InferenceServices from typed templates — vLLM presets (AWQ, BF16), GGUF via llama.cpp, text-embeddings-inference, and vision-language models, or any ServingRuntime you bring — then drill from service to pod to logs without leaving the UI. Pause an idle deployment to release its GPUs and resume it on demand.

Cost, latency, and errors per model

Every request is OTEL-instrumented. P50/P95/P99 latency, error rate, and token-level cost attributed per model, per user, and per API key — with budgets, rate limits, and automatic suspension at zero balance.

Built for agents

One model, many deployments: weighted load balancing with three-state circuit breakers and pre-first-token failover today. The playground speaks MCP and renders thinking modes. Roadmap: session affinity, policy routing, and verified / consensus / human-in-the-loop safety modes.

How it works

From model weights to a measured endpoint.

Step 01

Deploy a model

Pick a runtime template — vLLM or GGUF presets, or your own ServingRuntime — point at a PVC of weights, hit deploy. Multi-vendor GPU auto-detected.
Step 02

Get an endpoint

OpenAI- and Anthropic-compatible URLs, streaming SSE on both. API keys with rate limits, budgets, and per-model allowlists.
Step 03

Watch cost and latency

Every request OTEL-instrumented. P50/P95/P99, error rate, and cost attributed per model and per key — straight from the dashboard.

Who it's for

Built for these teams

Teams shipping LLM products on dedicated capacity
Platform teams consolidating inference cost and access
Builders of agentic systems with safety and audit needs

Pairs well with

Other builder products

ConsoleX

Available

On first SSO login every user gets an isolated namespace with quotas, default-deny networking, storage, and a web terminal — provisioned automatically, reconciled continuously.

Learn more

DevSpace

Available

Jupyter or VS Code on a GPU in seconds. Idle environments shut themselves down.

Single-click Jupyter, Marimo, Streamlit, Gradio, and VS Code environments — GPU-ready, isolated per user behind a per-pod auth proxy, with SSH access and idle shutdown by default.

Learn more

TrainX

Available

Admins write the template. Users fill a form. Kubernetes runs the job.

Self-describing training templates render straight into UI forms — with live quota checks, streaming logs, parsed progress bars, and one-click TensorBoard.

Learn more

InferX

See it in one block.

What InferX gives you

OpenAI- and Anthropic-compatible — and multimodal

Multi-vendor GPU and KServe-native

Cost, latency, and errors per model

Built for agents

From model weights to a measured endpoint.

Deploy a model

Get an endpoint

Watch cost and latency

Built for these teams

Other builder products

ConsoleX

DevSpace

TrainX