Skip to content
TAIP

Products / For AI builders

InferX

Available

Self-hosted Model-as-a-Service for the agentic era

InferX is an LLM inference platform you run on your own Kubernetes. It speaks both OpenAI and Anthropic Messages APIs on the front, deploys vLLM, TensorRT-LLM, SGLang, or llama.cpp on the back, and adds the operational layer most gateways skip: cost attribution per user and key, latency and error tracking per model, KServe deployment management, model downloads from Hugging Face and S3, and a built-in playground for testing before integration. The roadmap brings policy-based intelligent routing and safety modes for control-loop and safety-critical workloads.

Protocols
OpenAI · Anthropic · streaming SSE
Runtimes
vLLM · TRT-LLM · SGLang · llama.cpp
Hardware
NVIDIA · AMD · Intel · Ascend · Cambricon

Capabilities

What InferX gives you

01

OpenAI- and Anthropic-compatible

Drop-in /v1/chat/completions, /v1/embeddings, and /anthropic/v1/messages. Streaming SSE. Point your existing SDK or `claude-code` at InferX by changing the base URL.

02

Multi-vendor GPU and KServe-native

Auto-detected NVIDIA, AMD, Intel, Huawei Ascend, and Cambricon. Deploy InferenceServices from typed templates with built-in vLLM, TRT-LLM, AWQ, BF16, and GGUF presets.

03

Cost, latency, and errors per model

Every request is OTEL-instrumented. P50/P95/P99 latency, error rate, and cost are attributed per model and per API key — answerable from the admin dashboard, not a Grafana hunt.

04

Built for agents

Roadmap brings policy-based routing, session affinity, content-aware deployment selection, and safety modes — verified, consensus, and human-in-the-loop — for control-loop and audit-grade workloads.

How it works

From model weights to a measured endpoint.

  1. Step 01

    Deploy a model

    Pick a runtime template — vLLM, TRT-LLM, GGUF — point at a PVC of weights, hit deploy. Multi-vendor GPU auto-detected.

  2. Step 02

    Get an endpoint

    OpenAI- and Anthropic-compatible URLs, streaming SSE on both. API keys with rate limits, budgets, and per-model whitelists.

  3. Step 03

    Watch cost and latency

    Every request OTEL-instrumented. P50/P95/P99, error rate, and cost attributed per model and per key — straight from the dashboard.

Who it's for

Built for these teams

  • Teams shipping LLM products on dedicated capacity
  • Platform teams consolidating inference cost and access
  • Builders of agentic systems with safety and audit needs