Skip to content
TAIP

Products / For AI builders

Slurm on TAIP

Available

Your Slurm scripts, unchanged — running on Kubernetes, with one login, one home, one quota.

Slurm on TAIP runs genuine, multi-tenant Slurm — not an emulation — on your Kubernetes clusters, built on SchedMD's upstream Slinky slurm-operator. Real slurmctld / slurmd / slurmdbd / slurmrestd schedule jobs, so the behavior you depend on (MPI, accounting semantics, partitions, QOS, array jobs) is exactly Slurm's. It fuses into the rest of TAIP: SSH login authenticates against your platform identity (LDAP via SSSD), your /home is provisioned on first login on shared cluster storage, and a small leader-elected controller (account-sync) reconciles each user's ConsoleX workspace quota into Slurm accounts and GrpTRES limits — strictly enforced. GPUs are first-class with hard cgroup isolation, and an Open OnDemand web portal adds browser-based Files, Jobs, Shell, and JupyterLab for users who never want to touch SSH. Slurm 25.11 uses auth/slurm, so there are no MUNGE daemons to operate; upstream images are mirrored as-is, never forked.

Specification

Status
v0.4.4 — in production on Kubernetes
Slurm
Genuine Slurm 25.11 · Slinky slurm-operator 1.1.1 · auth/slurm (no MUNGE)
Access
SSH (LDAP/SSSD) · JWT slurmrestd · Open OnDemand web portal
Identity
Platform LDAP directory · POSIX uid/gid · pam_mkhomedir on first login
Storage
Shared cluster storage — /home + /scratch, same files as the rest of TAIP
Tenancy
ConsoleX quota → Slurm accounts + GrpTRES (CPU/mem/GPU), strictly enforced
GPU
gres/gpu + a gpu partition · hard cgroup ConstrainDevices isolation

Proof, not promises

See it in one block.

No proprietary SDKs, no rewrites — Slurm on TAIP meets your tools where they already are.

the script you already have, byte for byte
$ ssh you@taip-login            # your platform password — same as everywhere
$ ls ~                          # /home already mounted on cluster storage
data/  train.slurm
$ sbatch train.slurm            # unchanged — 0 lines edited
Submitted batch job 4127
$ squeue --me                   # running on a real GPU node
JOBID  PARTITION  NAME   ST  TIME  NODES
 4127        gpu  train    R  0:08      1

Genuine Slurm 25.11 — sbatch / srun / salloc / sacct, partitions, QOS, array jobs all behave exactly as Slurm behaves. It runs inside the Kubernetes platform, not in a parallel universe next to it.

Capabilities

What Slurm on TAIP gives you

01

Your scripts run unchanged

Real slurmctld / slurmd / slurmdbd / slurmrestd (upstream Slurm 25.11) schedule your jobs — not a Kubernetes scheduler in a Slurm costume. sbatch, srun, salloc, sacct, partitions, QOS, and array jobs behave exactly the way Slurm behaves. Zero lines changed.

02

One identity, you're already in

SSH in with your existing platform credentials — login nodes run SSSD bound to the platform LDAP directory. POSIX uid/gid come from your directory and pam_mkhomedir creates your home on first login. No separate Slurm user database, no second password.

03

One data plane

/home/<user> and a shared /scratch live on the same cluster storage the rest of the platform uses. The file your Slurm job writes is the file your notebook reads — no copy-in, no copy-out, no data silo.

04

One quota, kept in sync

account-sync, a small leader-elected controller, reads each user's workspace quota from ConsoleX and reconciles it into a Slurm account, association, and GrpTRES limit (CPU / memory / gres/gpu). Strictly enforced — over-limit jobs are rejected, not best-effort throttled.

05

GPUs, scheduled and isolated

Ask for the GPU partition and a card (-p gpu --gres=gpu:1) and Slurm schedules it. Hard cgroup ConstrainDevices isolation means a job can't even open a GPU it wasn't allocated — GPU capacity is a governed, scheduled resource, not a free-for-all.

06

Web portal — no SSH required

An Open OnDemand portal adds browser-based Files, Jobs, and Shell, plus a JupyterLab interactive app that launches as a Slurm job in your own conda/venv with optional GPUs. OIDC single sign-on with your platform identity.

07

Automate over REST

Beyond SSH, a JWT-authenticated slurmrestd (including the /slurmdb accounting endpoints) lets you submit and track jobs from CI, pipelines, or your own tooling.

08

Upstream, not a fork

Built on SchedMD's vendor-supported Slinky operator; the official slurmctld / slurmd / slurmrestd images are mirrored as-is, and only the login image is layered FROM upstream to add SSSD + cluster storage. You get SchedMD's genuine releases, fast — not a fork that lags. Kubernetes-native lifecycle: Helm install, operator reconciliation, Secrets-managed keys.

How it works

From platform login to a running job, no migration.

  1. Step 01

    Log in with your platform identity

    SSH in (or open the web portal) with the credentials you already use. SSSD authenticates against the platform LDAP directory; your /home is provisioned on first login on shared cluster storage.

  2. Step 02

    Submit the script you already have

    sbatch train.slurm — byte for byte, no rewrite. Real Slurm 25.11 schedules it onto the dedicated node pool, with partitions, QOS, and array jobs working as they always have.

  3. Step 03

    Get GPUs and stay within quota

    Request the gpu partition and a card; Slurm schedules and hard-isolates it. Your ConsoleX workspace quota is your GrpTRES limit, reconciled automatically and strictly enforced.

  4. Step 04

    Track it your way

    Watch jobs with squeue/sacct, from the Open OnDemand portal, or over the JWT slurmrestd API — all backed by one shared accounting database.

Who it's for

Built for these teams

  • Research and ML teams with years of Slurm scripts and muscle memory
  • Platform owners standardizing on Kubernetes who still need to serve HPC users
  • Organizations replacing a separate bare-metal Slurm cluster with one governed platform
  • Teams that want batch scheduling and quotas without operating a second identity and storage plane