Skip to content
TAIP

Products / For platform admins

TAIP Admin

Available

The whole AI cluster on one screen. kubectl optional.

TAIP Admin is a web-based administration console purpose-built for AI infrastructure. One Go binary serves the API and the SPA. It auto-detects metrics-server, Kueue, KServe, Training Operator, cert-manager, Gateway API, DRA, and VPA — each integration lights up when its API appears and disappears cleanly when it doesn't — and connects to Prometheus, Alertmanager, and Grafana with one URL each, degrading gracefully when they're absent. Resource accounting works in three tiers: requests and capacity from the K8s API alone, live usage with metrics-server, 30-day history with Prometheus. It is engineered, not just assembled: listing 617 secrets takes 0.5 seconds instead of 62 via metadata-only reads — and secret values never reach the browser.

Specification

Version
v1.6.8 — generally available
Footprint
Single Go binary · one Helm release · amd64 + arm64
Auto-detected
metrics-server · Kueue · KServe · Training Operator · cert-manager · Gateway API · DRA · VPA
Connects to
Prometheus · Alertmanager · Grafana — one URL each, optional
GPU telemetry
Per-accelerator — NVIDIA DCGM · Ascend NPU: utilization, memory, temp, power
Roles
OIDC · admin / viewer split · secrets never sent to browser
Languages
English · 简体中文

Proof, not promises

See it in one block.

No proprietary SDKs, no rewrites — TAIP Admin meets your tools where they already are.

integrations light up by themselves
$ helm install taip-admin taip/taip-admin
detected   metrics-server ✓  kueue ✓  kserve ✓  cert-manager ✓
detected   gateway-api ✓  dra ✓  vpa ✓  training-operator ✓
connected  prometheus ✓  alertmanager ✓  grafana ✓   # one URL each
# remove a component and its pages vanish cleanly
# 617 secrets listed in 0.5s (metadata-only) · values never leave the cluster

One binary, one Helm release. The console grows and shrinks with your stack — across Kueue versions, without rebuilds.

Capabilities

What TAIP Admin gives you

01

GPU and AI workloads, first-class

Extended resources, DRA device browsing, and per-accelerator telemetry for NVIDIA DCGM and Ascend NPU — utilization, memory, temperature, power. A cluster GPU heatmap shows idle-vs-active capacity with owner attribution, alongside topology and MIG. KServe InferenceServices and ServingRuntimes, plus Kueue queue management via API discovery — one binary across versions.

02

Three-tier resource accounting

Requests, limits, and capacity from the K8s API alone; live CPU and memory when metrics-server is present; 1h–30d history when Prometheus is configured. The same UI scales with your stack — and stays fast: 617 secrets in 0.5s, not 62s.

03

Alerts, silences, and Grafana deep-links

Severity-coded alert tables with one-click silence creation pre-filled from the alert's matchers. Live alert badge in the sidebar. Open-in-Grafana buttons that carry cluster, node, namespace, and pod context with them. Configure outbound receivers — email/SMTP or a CloudSentry webhook — from the console.

04

War Room for incidents

A full-screen NOC dashboard with auto-refresh, live event SSE feed, node grid with per-node mini gauges, and resource panels — built for wall displays and on-call shifts. Node cordon and drain with real-time eviction progress.

05

Audit trail, idle reclaim, and an app catalog

A structured audit log of every mutating admin action, with optional persistent history you can query in-app. Idle-resource detection flags idle GPUs, inactive notebooks, and stale jobs to reclaim. An OCI/Helm app catalog browses charts from your registry and installs them through a guided wizard — and a bilingual user guide ships inside the binary.

06

Identity, topology, and queue analytics

Manage platform users and groups, sessions, and MFA devices from the console. Visualize cluster hierarchy and intra-node NUMA topology. Read Kueue queue analytics — wait time, depth, fairness, and preemptions — and curate the taip-portal app catalog and broadcast announcements.

How it works

Install it, then operate the cluster.

  1. Step 01

    Point it at a cluster

    One Go binary, one Helm release, one optional CRD. OIDC for SSO. The whole console is a single process.

  2. Step 02

    Integrations light up automatically

    Kueue, KServe, DRA, VPA, cert-manager, Gateway API — auto-detected from the cluster. Prometheus, Alertmanager, and Grafana connect with one URL each.

  3. Step 03

    Operate and respond

    Severity-coded alerts, one-click silences, War Room dashboard, live event SSE, node cordon and drain — without a kubectl tab open.

Who it's for

Built for these teams

  • Platform engineers running shared AI clusters
  • On-call responders investigating incidents
  • Auditors and read-only viewers (admin/viewer roles built in)