Products / For AI builders

Slurm on TAIP

Available

Your Slurm scripts, unchanged — running on Kubernetes, with one login, one home, one quota.

Slurm on TAIP runs genuine, multi-tenant Slurm — not an emulation — on your Kubernetes clusters, built on SchedMD's upstream Slinky slurm-operator. Real slurmctld / slurmd / slurmdbd / slurmrestd schedule jobs, so the behavior you depend on (MPI, accounting semantics, partitions, QOS, array jobs) is exactly Slurm's. It fuses into the rest of TAIP: SSH login authenticates against your platform identity (LDAP via SSSD), your /home is provisioned on first login on shared cluster storage, and a small leader-elected controller (account-sync) reconciles each user's ConsoleX workspace quota into Slurm accounts and GrpTRES limits — strictly enforced. GPUs are first-class with hard cgroup isolation, and an Open OnDemand web portal adds browser-based Files, Jobs, Shell, and JupyterLab for users who never want to touch SSH. Slurm 25.11 uses auth/slurm, so there are no MUNGE daemons to operate; upstream images are mirrored as-is, never forked.

All products

Specification

Status: v0.4.4 — in production on Kubernetes
Slurm: Genuine Slurm 25.11 · Slinky slurm-operator 1.1.1 · auth/slurm (no MUNGE)
Access: SSH (LDAP/SSSD) · JWT slurmrestd · Open OnDemand web portal
Identity: Platform LDAP directory · POSIX uid/gid · pam_mkhomedir on first login
Storage: Shared cluster storage — /home + /scratch, same files as the rest of TAIP
Tenancy: ConsoleX quota → Slurm accounts + GrpTRES (CPU/mem/GPU), strictly enforced
GPU: gres/gpu + a gpu partition · hard cgroup ConstrainDevices isolation

Proof, not promises

See it in one block.

No proprietary SDKs, no rewrites — Slurm on TAIP meets your tools where they already are.

the script you already have, byte for byte

$ ssh you@taip-login            # your platform password — same as everywhere
$ ls ~                          # /home already mounted on cluster storage
data/  train.slurm
$ sbatch train.slurm            # unchanged — 0 lines edited
Submitted batch job 4127
$ squeue --me                   # running on a real GPU node
JOBID  PARTITION  NAME   ST  TIME  NODES
 4127        gpu  train    R  0:08      1▌

Genuine Slurm 25.11 — sbatch / srun / salloc / sacct, partitions, QOS, array jobs all behave exactly as Slurm behaves. It runs inside the Kubernetes platform, not in a parallel universe next to it.

Capabilities

What Slurm on TAIP gives you

Your scripts run unchanged

Real slurmctld / slurmd / slurmdbd / slurmrestd (upstream Slurm 25.11) schedule your jobs — not a Kubernetes scheduler in a Slurm costume. sbatch, srun, salloc, sacct, partitions, QOS, and array jobs behave exactly the way Slurm behaves. Zero lines changed.

One identity, you're already in

SSH in with your existing platform credentials — login nodes run SSSD bound to the platform LDAP directory. POSIX uid/gid come from your directory and pam_mkhomedir creates your home on first login. No separate Slurm user database, no second password.

One data plane

/home/<user> and a shared /scratch live on the same cluster storage the rest of the platform uses. The file your Slurm job writes is the file your notebook reads — no copy-in, no copy-out, no data silo.

One quota, kept in sync

account-sync, a small leader-elected controller, reads each user's workspace quota from ConsoleX and reconciles it into a Slurm account, association, and GrpTRES limit (CPU / memory / gres/gpu). Strictly enforced — over-limit jobs are rejected, not best-effort throttled.

GPUs, scheduled and isolated

Ask for the GPU partition and a card (-p gpu --gres=gpu:1) and Slurm schedules it. Hard cgroup ConstrainDevices isolation means a job can't even open a GPU it wasn't allocated — GPU capacity is a governed, scheduled resource, not a free-for-all.

Web portal — no SSH required

An Open OnDemand portal adds browser-based Files, Jobs, and Shell, plus a JupyterLab interactive app that launches as a Slurm job in your own conda/venv with optional GPUs. OIDC single sign-on with your platform identity.

Automate over REST

Beyond SSH, a JWT-authenticated slurmrestd (including the /slurmdb accounting endpoints) lets you submit and track jobs from CI, pipelines, or your own tooling.

Upstream, not a fork

Built on SchedMD's vendor-supported Slinky operator; the official slurmctld / slurmd / slurmrestd images are mirrored as-is, and only the login image is layered FROM upstream to add SSSD + cluster storage. You get SchedMD's genuine releases, fast — not a fork that lags. Kubernetes-native lifecycle: Helm install, operator reconciliation, Secrets-managed keys.

How it works

From platform login to a running job, no migration.

Step 01

Log in with your platform identity

SSH in (or open the web portal) with the credentials you already use. SSSD authenticates against the platform LDAP directory; your /home is provisioned on first login on shared cluster storage.
Step 02

Submit the script you already have

sbatch train.slurm — byte for byte, no rewrite. Real Slurm 25.11 schedules it onto the dedicated node pool, with partitions, QOS, and array jobs working as they always have.
Step 03

Get GPUs and stay within quota

Request the gpu partition and a card; Slurm schedules and hard-isolates it. Your ConsoleX workspace quota is your GrpTRES limit, reconciled automatically and strictly enforced.
Step 04

Track it your way

Watch jobs with squeue/sacct, from the Open OnDemand portal, or over the JWT slurmrestd API — all backed by one shared accounting database.

Who it's for

Built for these teams

Research and ML teams with years of Slurm scripts and muscle memory
Platform owners standardizing on Kubernetes who still need to serve HPC users
Organizations replacing a separate bare-metal Slurm cluster with one governed platform
Teams that want batch scheduling and quotas without operating a second identity and storage plane

Pairs well with

Other builder products

ConsoleX

Available

On first SSO login every user gets an isolated namespace with quotas, default-deny networking, storage, and a web terminal — provisioned automatically, reconciled continuously.

Learn more

DevSpace

Available

Jupyter or VS Code on a GPU in seconds. Idle environments shut themselves down.

Single-click Jupyter, Marimo, Streamlit, Gradio, and VS Code environments — GPU-ready, isolated per user behind a per-pod auth proxy, with SSH access and idle shutdown by default.

Learn more

TrainX

Available

Admins write the template. Users fill a form. Kubernetes runs the job.

Self-describing training templates render straight into UI forms — with live quota checks, streaming logs, parsed progress bars, and one-click TensorBoard.

Learn more

Slurm on TAIP

See it in one block.

What Slurm on TAIP gives you

Your scripts run unchanged

One identity, you're already in

One data plane

One quota, kept in sync

GPUs, scheduled and isolated

Web portal — no SSH required

Automate over REST

Upstream, not a fork

From platform login to a running job, no migration.

Log in with your platform identity

Submit the script you already have

Get GPUs and stay within quota

Track it your way

Built for these teams

Other builder products

ConsoleX

DevSpace

TrainX