Skip to content
TAIP

Products / For AI builders

GrokX

Available

Ground your agents in your documents — scanned PDFs included, every answer cited to the page.

GrokX is the knowledge component of TAIP — the third of the trio: InferX serves models, AgentX runs agents, GrokX serves knowledge. It turns a document corpus, including scanned PDFs, into something agents can ground their answers on, with page-level citations. Ingestion runs once: born-digital pages are read from their text layer and scanned pages are OCR'd, chunked (by paragraph, heading, or sentence), and kept fresh with SHA-256 manifests so re-ingest skips unchanged docs. Retrieval is hybrid by default — sparse keyword vectors and dense embeddings are combined with rank fusion in the vector DB, then reordered by a cross-encoder reranker — with a text-mirror mode (a markdown tree you mount and grep) for smaller corpora. Everything is multi-tenant: a web console with OIDC SSO, scoped personal access tokens, and per-KB RBAC governs many isolated knowledge bases, each its own collection in the vector DB. Registered on an AgentX Agent under the alias kb, mcp__kb__search(query, kb) returns matching passages with their source and page, so the model cites 'page 12 of report.pdf' instead of guessing. Embeddings and reranking call your own InferX endpoints; it ships as a Helm-packaged TAIP app — MCP server, console, vector DB, and an indexer — and runs end-to-end in production today.

Specification

Status
v0.7.0 — shipped, running in production
Ingest
Born-digital text + OCR for scans · PDF · Word · HTML · Markdown · text
Retrieval
Hybrid (sparse keyword + dense) with rank fusion · cross-encoder rerank · page citations
Embeddings
Pluggable model via InferX — OpenAI-compatible /embeddings
Store
Pluggable vector DB (named dense + sparse vectors) · local store for dev
Access
Web console · OIDC SSO · scoped PATs · per-KB RBAC · audit trail
Serving
MCP server (streamable HTTP) · Helm: server + console + vector DB + indexer

Proof, not promises

See it in one block.

No proprietary SDKs, no rewrites — GrokX meets your tools where they already are.

a corpus becomes a tool your agents can call
$ grokx push ./corpus --kb research   # upload + OCR scans + index, resumable
ingested 142 docs · 38 OCR'd · 1,907 pages → indexed 9,841 chunks
$ grokx serve                          # MCP server (streamable HTTP) on :8080
serving 6 knowledge bases
# an AgentX Agent calls the tool — hybrid + reranked, with a citation:
mcp__kb__search("Q3 revenue", kb="research") → "…revenue was $4.2M…"  [report.pdf p.12]

OCR and embedding run once at ingest, not per query. Sparse keyword and dense vectors are fused and reranked at search time, and every passage keeps its source and page so the answer can be cited.

Capabilities

What GrokX gives you

01

Ingestion that reads scanned PDFs

Walk a corpus and extract every page: born-digital text straight from the text layer, image-only scans via OCR. PDF, Word, HTML, Markdown, and plain text all ingest. grep over raw PDF bytes is useless and text models can't see page images — so extraction is mandatory, and GrokX does it once, degrading to ocr-skipped rather than failing.

02

Hybrid retrieval, reranked

Sparse keyword vectors and dense embeddings are stored together in the vector DB and combined with rank fusion, then a cross-encoder reranker reorders the top results. Lexical recall and semantic recall in one query — or mount the markdown text mirror and grep it for smaller corpora.

03

Page-level citations

Every chunk keeps its source and page, so an agent can answer 'per page 12 of report.pdf' instead of producing an unverifiable claim. Provenance is preserved from ingest through retrieval and rerank.

04

A tool AgentX can call

grokx serve exposes an MCP server. Registered on an AgentX Agent under the alias kb, it becomes mcp__kb__search(query, kb, k, source?, page?) — plus list_knowledge_bases, list_sources, and get_document. The model decides when to search and gets passages back with citations. The vector store lives in GrokX, never inside the agent sandbox.

05

Many knowledge bases, governed

A web console with OIDC SSO manages multiple isolated knowledge bases — each its own collection in the vector DB. Per-KB RBAC (viewer / editor / owner), ACL sharing to users and groups, scoped personal access tokens, and an append-only audit trail. The KB is the unit of access control.

06

Four ways to ingest, kept fresh

Upload through the web console (resumable), grokx push / sync from the CLI, mount a WebDAV folder, or wire a scheduled git connector. SHA-256 manifests track every source so re-ingest and re-index skip unchanged docs and prune deletions — expensive OCR and embedding work is never repeated.

07

Embeddings and rerank on your InferX

Embeddings and reranking call your own OpenAI-compatible InferX endpoints — no third-party embedding API, no data leaving the perimeter. A dependency-free local store and hash embedder cover dev with no infra.

How it works

From a pile of PDFs to a cited answer.

  1. Step 01

    Ingest and OCR the corpus

    grokx push (or the web console, WebDAV, or a git connector) extracts born-digital text and OCRs scanned pages, chunks them, and indexes — once, incrementally, with provenance preserved.

  2. Step 02

    Index into hybrid search

    Chunks are embedded and stored alongside sparse keyword vectors in a per-KB collection in the vector DB — ready for fused, reranked retrieval, or mounted as a markdown mirror to grep.

  3. Step 03

    Serve it to agents over MCP

    grokx serve registers the kb tool on an AgentX Agent. The model calls mcp__kb__search when it needs evidence and gets back passages with source and page.

  4. Step 04

    Agents answer with citations

    Responses are grounded in your documents and anchored to the exact page — verifiable, not guessed.

Who it's for

Built for these teams

  • Teams building agents that must answer from private documents
  • Anyone with a corpus of scanned PDFs that lexical search can't read
  • AI app teams that need grounded, citable answers — not hallucinations
  • Platform teams standing up a shared, governed, multi-tenant knowledge index