Skip to content
TAIP

Products / For AI builders

InferX

Available

OpenAI- and Anthropic-compatible inference on your GPUs — measured to the token.

InferX is an inference platform you run on your own Kubernetes. It speaks both OpenAI and Anthropic Messages APIs on the front, deploys vLLM, llama.cpp, or text-embeddings-inference from built-in presets — or any KServe ServingRuntime you bring — on the back, and adds the operational layer most gateways skip: one logical model backed by multiple deployments with load balancing and three-state circuit breakers, cost attribution per user and key, P50/P95/P99 latency per model, full KServe lifecycle management from templates, model downloads from Hugging Face and S3, and a playground with streaming, thinking-mode rendering, and MCP tool calling. Beyond chat it now serves embeddings and /v1/rerank, audio over /v1/audio/speech (TTS) and /v1/audio/transcriptions (ASR), vision-language models, and KServe v2 / Open Inference Protocol inference — and serving instances pause and resume to release idle GPUs. The roadmap adds policy-based routing and safety modes for control-loop workloads.

Specification

Version
v2.15 — generally available
Protocols
OpenAI (chat · embeddings · rerank · audio) · Anthropic Messages · KServe v2 · streaming SSE
Runtimes
vLLM · llama.cpp · text-embeddings-inference presets · any KServe ServingRuntime
Hardware
NVIDIA · AMD · Intel · Ascend · Cambricon — auto-detected
Routing
Multi-deployment per model · weighted balancing · circuit breakers

Proof, not promises

See it in one block.

No proprietary SDKs, no rewrites — InferX meets your tools where they already are.

change one URL, keep your code
from openai import OpenAI
client = OpenAI(
    base_url="https://inferx.intra.example/api",  # ← the change
    api_key=os.environ["INFERX_API_KEY"],
)
# Anthropic SDKs and claude-code work the same way via /anthropic/v1
# every request lands in the dashboard: cost · P50/P95/P99 · errors

Drop-in for both OpenAI and Anthropic SDKs, streaming included — with per-key budgets, rate limits, and model allowlists.

Capabilities

What InferX gives you

01

OpenAI- and Anthropic-compatible — and multimodal

Drop-in /v1/chat/completions, /v1/embeddings, /v1/rerank, /v1/audio/speech and /v1/audio/transcriptions, plus /anthropic/v1/messages and KServe v2 /v2/models/:model/infer — all with streaming SSE. Point your existing SDK or claude-code at InferX by changing the base URL. Native in-process providers — no proxy hop in the request path.

02

Multi-vendor GPU and KServe-native

Auto-detected NVIDIA, AMD, Intel, Huawei Ascend, and Cambricon. Deploy InferenceServices from typed templates — vLLM presets (AWQ, BF16), GGUF via llama.cpp, text-embeddings-inference, and vision-language models, or any ServingRuntime you bring — then drill from service to pod to logs without leaving the UI. Pause an idle deployment to release its GPUs and resume it on demand.

03

Cost, latency, and errors per model

Every request is OTEL-instrumented. P50/P95/P99 latency, error rate, and token-level cost attributed per model, per user, and per API key — with budgets, rate limits, and automatic suspension at zero balance.

04

Built for agents

One model, many deployments: weighted load balancing with three-state circuit breakers and pre-first-token failover today. The playground speaks MCP and renders thinking modes. Roadmap: session affinity, policy routing, and verified / consensus / human-in-the-loop safety modes.

How it works

From model weights to a measured endpoint.

  1. Step 01

    Deploy a model

    Pick a runtime template — vLLM or GGUF presets, or your own ServingRuntime — point at a PVC of weights, hit deploy. Multi-vendor GPU auto-detected.

  2. Step 02

    Get an endpoint

    OpenAI- and Anthropic-compatible URLs, streaming SSE on both. API keys with rate limits, budgets, and per-model allowlists.

  3. Step 03

    Watch cost and latency

    Every request OTEL-instrumented. P50/P95/P99, error rate, and cost attributed per model and per key — straight from the dashboard.

Who it's for

Built for these teams

  • Teams shipping LLM products on dedicated capacity
  • Platform teams consolidating inference cost and access
  • Builders of agentic systems with safety and audit needs