Products / For AI builders
Slurm on TAIP
AvailableYour Slurm scripts, unchanged — running on Kubernetes, with one login, one home, one quota.
Slurm on TAIP runs genuine, multi-tenant Slurm — not an emulation — on your Kubernetes clusters, built on SchedMD's upstream Slinky slurm-operator. Real slurmctld / slurmd / slurmdbd / slurmrestd schedule jobs, so the behavior you depend on (MPI, accounting semantics, partitions, QOS, array jobs) is exactly Slurm's. It fuses into the rest of TAIP: SSH login authenticates against your platform identity (LDAP via SSSD), your /home is provisioned on first login on shared cluster storage, and a small leader-elected controller (account-sync) reconciles each user's ConsoleX workspace quota into Slurm accounts and GrpTRES limits — strictly enforced. GPUs are first-class with hard cgroup isolation, and an Open OnDemand web portal adds browser-based Files, Jobs, Shell, and JupyterLab for users who never want to touch SSH. Slurm 25.11 uses auth/slurm, so there are no MUNGE daemons to operate; upstream images are mirrored as-is, never forked.
Specification
- Status
- v0.4.4 — in production on Kubernetes
- Slurm
- Genuine Slurm 25.11 · Slinky slurm-operator 1.1.1 · auth/slurm (no MUNGE)
- Access
- SSH (LDAP/SSSD) · JWT slurmrestd · Open OnDemand web portal
- Identity
- Platform LDAP directory · POSIX uid/gid · pam_mkhomedir on first login
- Storage
- Shared cluster storage — /home + /scratch, same files as the rest of TAIP
- Tenancy
- ConsoleX quota → Slurm accounts + GrpTRES (CPU/mem/GPU), strictly enforced
- GPU
- gres/gpu + a gpu partition · hard cgroup ConstrainDevices isolation
Proof, not promises
See it in one block.
No proprietary SDKs, no rewrites — Slurm on TAIP meets your tools where they already are.
$ ssh you@taip-login # your platform password — same as everywhere
$ ls ~ # /home already mounted on cluster storage
data/ train.slurm
$ sbatch train.slurm # unchanged — 0 lines edited
Submitted batch job 4127
$ squeue --me # running on a real GPU node
JOBID PARTITION NAME ST TIME NODES
4127 gpu train R 0:08 1▌ Genuine Slurm 25.11 — sbatch / srun / salloc / sacct, partitions, QOS, array jobs all behave exactly as Slurm behaves. It runs inside the Kubernetes platform, not in a parallel universe next to it.
Capabilities
What Slurm on TAIP gives you
Your scripts run unchanged
Real slurmctld / slurmd / slurmdbd / slurmrestd (upstream Slurm 25.11) schedule your jobs — not a Kubernetes scheduler in a Slurm costume. sbatch, srun, salloc, sacct, partitions, QOS, and array jobs behave exactly the way Slurm behaves. Zero lines changed.
One identity, you're already in
SSH in with your existing platform credentials — login nodes run SSSD bound to the platform LDAP directory. POSIX uid/gid come from your directory and pam_mkhomedir creates your home on first login. No separate Slurm user database, no second password.
One data plane
/home/<user> and a shared /scratch live on the same cluster storage the rest of the platform uses. The file your Slurm job writes is the file your notebook reads — no copy-in, no copy-out, no data silo.
One quota, kept in sync
account-sync, a small leader-elected controller, reads each user's workspace quota from ConsoleX and reconciles it into a Slurm account, association, and GrpTRES limit (CPU / memory / gres/gpu). Strictly enforced — over-limit jobs are rejected, not best-effort throttled.
GPUs, scheduled and isolated
Ask for the GPU partition and a card (-p gpu --gres=gpu:1) and Slurm schedules it. Hard cgroup ConstrainDevices isolation means a job can't even open a GPU it wasn't allocated — GPU capacity is a governed, scheduled resource, not a free-for-all.
Web portal — no SSH required
An Open OnDemand portal adds browser-based Files, Jobs, and Shell, plus a JupyterLab interactive app that launches as a Slurm job in your own conda/venv with optional GPUs. OIDC single sign-on with your platform identity.
Automate over REST
Beyond SSH, a JWT-authenticated slurmrestd (including the /slurmdb accounting endpoints) lets you submit and track jobs from CI, pipelines, or your own tooling.
Upstream, not a fork
Built on SchedMD's vendor-supported Slinky operator; the official slurmctld / slurmd / slurmrestd images are mirrored as-is, and only the login image is layered FROM upstream to add SSSD + cluster storage. You get SchedMD's genuine releases, fast — not a fork that lags. Kubernetes-native lifecycle: Helm install, operator reconciliation, Secrets-managed keys.
How it works
From platform login to a running job, no migration.
- Step 01
Log in with your platform identity
SSH in (or open the web portal) with the credentials you already use. SSSD authenticates against the platform LDAP directory; your /home is provisioned on first login on shared cluster storage.
- Step 02
Submit the script you already have
sbatch train.slurm — byte for byte, no rewrite. Real Slurm 25.11 schedules it onto the dedicated node pool, with partitions, QOS, and array jobs working as they always have.
- Step 03
Get GPUs and stay within quota
Request the gpu partition and a card; Slurm schedules and hard-isolates it. Your ConsoleX workspace quota is your GrpTRES limit, reconciled automatically and strictly enforced.
- Step 04
Track it your way
Watch jobs with squeue/sacct, from the Open OnDemand portal, or over the JWT slurmrestd API — all backed by one shared accounting database.
Who it's for
Built for these teams
- Research and ML teams with years of Slurm scripts and muscle memory
- Platform owners standardizing on Kubernetes who still need to serve HPC users
- Organizations replacing a separate bare-metal Slurm cluster with one governed platform
- Teams that want batch scheduling and quotas without operating a second identity and storage plane
Pairs well with
Other builder products
ConsoleX
AvailableLog in, get a governed Kubernetes workspace. No kubectl, no tickets.
On first SSO login every user gets an isolated namespace with quotas, default-deny networking, storage, and a web terminal — provisioned automatically, reconciled continuously.
Learn moreDevSpace
AvailableJupyter or VS Code on a GPU in seconds. Idle environments shut themselves down.
Single-click Jupyter, Marimo, Streamlit, Gradio, and VS Code environments — GPU-ready, isolated per user behind a per-pod auth proxy, with SSH access and idle shutdown by default.
Learn moreTrainX
AvailableAdmins write the template. Users fill a form. Kubernetes runs the job.
Self-describing training templates render straight into UI forms — with live quota checks, streaming logs, parsed progress bars, and one-click TensorBoard.
Learn more