docstruct

Offline, rule-based document parser for AI pipelines. Extracts structured data (tables, key-value pairs, paragraphs, lists) from PDFs, DOCX, XLSX, HTML, images, and plain text — deterministically, with zero API calls.

Why

LLMs hallucinate table cells, skip rows, and cost money per call. docstruct gives you the same structured output every time, runs locally, and processes thousands of documents in minutes.

Built for products that parse documents regularly without AI.

Install

npm install docstruct

Requires Node.js >= 18.

Quick Start

import { parseDoc } from 'docstruct'

// From file path
const result = await parseDoc('statement.pdf')

// From buffer
const buffer = fs.readFileSync('report.pdf')
const result = await parseDoc(buffer, { type: 'pdf' })

API

`parseDoc(input, options?)`

Parameter	Type	Description
`input`	`string \| Buffer`	File path or buffer. When passing a buffer, `type` is required.
`options.type`	`SourceType`	File type: `'pdf'`, `'docx'`, `'xlsx'`, `'html'`, `'image'`, `'text'`
`options.extract`	`ExtractTarget[]`	What to extract: `'tables'`, `'keyValues'`, `'paragraphs'`, `'lists'`. Defaults to all.
`options.output`	`OutputFormat`	Output format: `'json'` (default), `'csv'`, `'markdown'`

Response

interface ParseResult {
  tables: Table[]
  keyValues: Record<string, string>
  paragraphs: string[]
  lists: string[][]
  warnings: string[]
  metadata: Metadata
}

interface Table {
  columns: string[]  // Header row
  rows: string[][]   // Data rows
}

interface Metadata {
  sourceType: 'pdf' | 'image' | 'text' | 'docx' | 'xlsx' | 'html'
  pages: number
  extractionMethod: 'text' | 'ocr' | 'direct'
}

Supported Formats

Format	Extensions	Parser
PDF	`.pdf`	pdfjs-dist (text + line extraction)
Word	`.docx`	mammoth
Excel	`.xlsx`	xlsx
HTML	`.html`, `.htm`	cheerio
Images	`.jpg`, `.png`	tesseract.js (OCR)
Plain text	`.txt`	Built-in

Table Extraction

docstruct uses multiple strategies to find tables, applied in this order:

Pipe-delimited — | col1 | col2 | markdown-style tables
Tab-separated — TSV-formatted rows
Lattice (PDF only) — extracts ruled lines from the PDF's drawing operators and builds grids from line intersections. Handles merged cells, multi-table pages, and double-drawn borders.
Header-anchored spatial — detects header rows with 4+ columns spread across the page, uses header positions as column anchors. Handles dense tables like bank statements where data elements have varying x-positions.
Spatial fallback — clusters text elements by x/y position to infer table structure

Each strategy claims the text elements it matches, so later strategies only process unclaimed elements.

Example: Bank Statement

const result = await parseDoc('momo-statement.pdf', {
  extract: ['tables'],
  type: 'pdf'
})

console.log(result.tables[0].columns)
// ["TRANSACTION DATE", "FROM ACCT", "FROM NAME", "FROM NO.", "TRANS. TYPE",
//  "AMOUNT", "FEES", "E-LEVY", "BAL BEFORE", "BAL AFTER", "TO NO.",
//  "TO NAME", "TO ACCT", "F_ID", "REF", "OVA"]

console.log(result.tables[0].rows[0])
// ["12-Dec-2024 09:03:10 PM", "53907613", "Derrick Tsorme", "233547759141",
//  "TRANSFER", "151", "1.13", "1.51", "1222.35", "1220.84", ...]

Output Formats

JSON (default)

Returns a ParseResult object.

CSV

const csv = await parseDoc('data.pdf', { output: 'csv' })
// Returns CSV string of the first table

Markdown

const md = await parseDoc('data.pdf', { output: 'markdown' })
// Returns markdown-formatted tables and content

Architecture

Input (file/buffer)
  → Parser (pdf, docx, xlsx, html, image, text)
    → DocumentIR (normalized text elements + line segments)
      → Extractors (tables, key-values, paragraphs, lists)
        → Formatters (json, csv, markdown)
          → Output

All parsers produce a common intermediate representation (DocumentIR) with normalized coordinates (0-1 range, top-left origin). Extractors work on this IR regardless of source format.

Development

npm install
npm test          # Run tests
npm run build     # Build to dist/
npm run test:watch # Watch mode

Web Companion

A browser-based viewer for testing document parsing:

cd web
npm install
npm run dev
# Open http://localhost:3000

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
docs/superpowers		docs/superpowers
src		src
tests		tests
web		web
.gitignore		.gitignore
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json
vitest.config.ts		vitest.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

docstruct

Why

Install

Quick Start

API

`parseDoc(input, options?)`

Response

Supported Formats

Table Extraction

Example: Bank Statement

Output Formats

JSON (default)

CSV

Markdown

Architecture

Development

Web Companion

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

docstruct

Why

Install

Quick Start

API

parseDoc(input, options?)

Response

Supported Formats

Table Extraction

Example: Bank Statement

Output Formats

JSON (default)

CSV

Markdown

Architecture

Development

Web Companion

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`parseDoc(input, options?)`

Packages