Products / For AI builders

GrokX

Available

Ground your agents in your documents — scanned PDFs included, every answer cited to the page.

GrokX is the knowledge component of TAIP — the third of the trio: InferX serves models, AgentX runs agents, GrokX serves knowledge. It turns a document corpus, including scanned PDFs, into something agents can ground their answers on, with page-level citations. Ingestion runs once: born-digital pages are read from their text layer and scanned pages are OCR'd, chunked (by paragraph, heading, or sentence), and kept fresh with SHA-256 manifests so re-ingest skips unchanged docs. Retrieval is hybrid by default — sparse keyword vectors and dense embeddings are combined with rank fusion in the vector DB, then reordered by a cross-encoder reranker — with a text-mirror mode (a markdown tree you mount and grep) for smaller corpora. Everything is multi-tenant: a web console with OIDC SSO, scoped personal access tokens, and per-KB RBAC governs many isolated knowledge bases, each its own collection in the vector DB. Registered on an AgentX Agent under the alias kb, mcp__kb__search(query, kb) returns matching passages with their source and page, so the model cites 'page 12 of report.pdf' instead of guessing. Embeddings and reranking call your own InferX endpoints; it ships as a Helm-packaged TAIP app — MCP server, console, vector DB, and an indexer — and runs end-to-end in production today.

All products

Specification

Status: v0.7.0 — shipped, running in production
Ingest: Born-digital text + OCR for scans · PDF · Word · HTML · Markdown · text
Retrieval: Hybrid (sparse keyword + dense) with rank fusion · cross-encoder rerank · page citations
Embeddings: Pluggable model via InferX — OpenAI-compatible /embeddings
Store: Pluggable vector DB (named dense + sparse vectors) · local store for dev
Access: Web console · OIDC SSO · scoped PATs · per-KB RBAC · audit trail
Serving: MCP server (streamable HTTP) · Helm: server + console + vector DB + indexer

Proof, not promises

See it in one block.

No proprietary SDKs, no rewrites — GrokX meets your tools where they already are.

a corpus becomes a tool your agents can call

$ grokx push ./corpus --kb research   # upload + OCR scans + index, resumable
ingested 142 docs · 38 OCR'd · 1,907 pages → indexed 9,841 chunks
$ grokx serve                          # MCP server (streamable HTTP) on :8080
serving 6 knowledge bases
# an AgentX Agent calls the tool — hybrid + reranked, with a citation:
mcp__kb__search("Q3 revenue", kb="research") → "…revenue was $4.2M…"  [report.pdf p.12]▌

OCR and embedding run once at ingest, not per query. Sparse keyword and dense vectors are fused and reranked at search time, and every passage keeps its source and page so the answer can be cited.

Capabilities

What GrokX gives you

Ingestion that reads scanned PDFs

Walk a corpus and extract every page: born-digital text straight from the text layer, image-only scans via OCR. PDF, Word, HTML, Markdown, and plain text all ingest. grep over raw PDF bytes is useless and text models can't see page images — so extraction is mandatory, and GrokX does it once, degrading to ocr-skipped rather than failing.

Hybrid retrieval, reranked

Sparse keyword vectors and dense embeddings are stored together in the vector DB and combined with rank fusion, then a cross-encoder reranker reorders the top results. Lexical recall and semantic recall in one query — or mount the markdown text mirror and grep it for smaller corpora.

Page-level citations

Every chunk keeps its source and page, so an agent can answer 'per page 12 of report.pdf' instead of producing an unverifiable claim. Provenance is preserved from ingest through retrieval and rerank.

A tool AgentX can call

grokx serve exposes an MCP server. Registered on an AgentX Agent under the alias kb, it becomes mcp__kb__search(query, kb, k, source?, page?) — plus list_knowledge_bases, list_sources, and get_document. The model decides when to search and gets passages back with citations. The vector store lives in GrokX, never inside the agent sandbox.

Many knowledge bases, governed

A web console with OIDC SSO manages multiple isolated knowledge bases — each its own collection in the vector DB. Per-KB RBAC (viewer / editor / owner), ACL sharing to users and groups, scoped personal access tokens, and an append-only audit trail. The KB is the unit of access control.

Four ways to ingest, kept fresh

Upload through the web console (resumable), grokx push / sync from the CLI, mount a WebDAV folder, or wire a scheduled git connector. SHA-256 manifests track every source so re-ingest and re-index skip unchanged docs and prune deletions — expensive OCR and embedding work is never repeated.

Embeddings and rerank on your InferX

Embeddings and reranking call your own OpenAI-compatible InferX endpoints — no third-party embedding API, no data leaving the perimeter. A dependency-free local store and hash embedder cover dev with no infra.

How it works

From a pile of PDFs to a cited answer.

Step 01

Ingest and OCR the corpus

grokx push (or the web console, WebDAV, or a git connector) extracts born-digital text and OCRs scanned pages, chunks them, and indexes — once, incrementally, with provenance preserved.
Step 02

Index into hybrid search

Chunks are embedded and stored alongside sparse keyword vectors in a per-KB collection in the vector DB — ready for fused, reranked retrieval, or mounted as a markdown mirror to grep.
Step 03

Serve it to agents over MCP

grokx serve registers the kb tool on an AgentX Agent. The model calls mcp__kb__search when it needs evidence and gets back passages with source and page.
Step 04

Agents answer with citations

Responses are grounded in your documents and anchored to the exact page — verifiable, not guessed.

Who it's for

Built for these teams

Teams building agents that must answer from private documents
Anyone with a corpus of scanned PDFs that lexical search can't read
AI app teams that need grounded, citable answers — not hallucinations
Platform teams standing up a shared, governed, multi-tenant knowledge index

Pairs well with

Other builder products

ConsoleX

Available

On first SSO login every user gets an isolated namespace with quotas, default-deny networking, storage, and a web terminal — provisioned automatically, reconciled continuously.

Learn more

DevSpace

Available

Jupyter or VS Code on a GPU in seconds. Idle environments shut themselves down.

Single-click Jupyter, Marimo, Streamlit, Gradio, and VS Code environments — GPU-ready, isolated per user behind a per-pod auth proxy, with SSH access and idle shutdown by default.

Learn more

TrainX

Available

Admins write the template. Users fill a form. Kubernetes runs the job.

Self-describing training templates render straight into UI forms — with live quota checks, streaming logs, parsed progress bars, and one-click TensorBoard.

Learn more

GrokX

See it in one block.

What GrokX gives you

Ingestion that reads scanned PDFs

Hybrid retrieval, reranked

Page-level citations

A tool AgentX can call

Many knowledge bases, governed

Four ways to ingest, kept fresh

Embeddings and rerank on your InferX

From a pile of PDFs to a cited answer.

Ingest and OCR the corpus

Index into hybrid search

Serve it to agents over MCP

Agents answer with citations

Built for these teams

Other builder products

ConsoleX

DevSpace

TrainX