Products / For AI builders

InferX

Available

Self-hosted Model-as-a-Service for the agentic era

InferX is an LLM inference platform you run on your own Kubernetes. It speaks both OpenAI and Anthropic Messages APIs on the front, deploys vLLM, TensorRT-LLM, SGLang, or llama.cpp on the back, and adds the operational layer most gateways skip: cost attribution per user and key, latency and error tracking per model, KServe deployment management, model downloads from Hugging Face and S3, and a built-in playground for testing before integration. The roadmap brings policy-based intelligent routing and safety modes for control-loop and safety-critical workloads.

All products

Protocols: OpenAI · Anthropic · streaming SSE
Runtimes: vLLM · TRT-LLM · SGLang · llama.cpp
Hardware: NVIDIA · AMD · Intel · Ascend · Cambricon

Capabilities

What InferX gives you

OpenAI- and Anthropic-compatible

Drop-in /v1/chat/completions, /v1/embeddings, and /anthropic/v1/messages. Streaming SSE. Point your existing SDK or `claude-code` at InferX by changing the base URL.

Multi-vendor GPU and KServe-native

Auto-detected NVIDIA, AMD, Intel, Huawei Ascend, and Cambricon. Deploy InferenceServices from typed templates with built-in vLLM, TRT-LLM, AWQ, BF16, and GGUF presets.

Cost, latency, and errors per model

Every request is OTEL-instrumented. P50/P95/P99 latency, error rate, and cost are attributed per model and per API key — answerable from the admin dashboard, not a Grafana hunt.

Built for agents

Roadmap brings policy-based routing, session affinity, content-aware deployment selection, and safety modes — verified, consensus, and human-in-the-loop — for control-loop and audit-grade workloads.

How it works

From model weights to a measured endpoint.

Step 01

Deploy a model

Pick a runtime template — vLLM, TRT-LLM, GGUF — point at a PVC of weights, hit deploy. Multi-vendor GPU auto-detected.
Step 02

Get an endpoint

OpenAI- and Anthropic-compatible URLs, streaming SSE on both. API keys with rate limits, budgets, and per-model whitelists.
Step 03

Watch cost and latency

Every request OTEL-instrumented. P50/P95/P99, error rate, and cost attributed per model and per key — straight from the dashboard.