Document Processing

DokuBrain processes documents through a pipeline: upload → parse → classify → extract → embed. Each step runs automatically when you upload a document.

Supported file types

Format	Extensions	Max size
PDF	`.pdf`	50 MB
Word	`.docx`	50 MB
HTML	`.html`, `.htm`	10 MB
Plain text	`.txt`	10 MB
Email	`.eml`	25 MB
Images	`.png`, `.jpg`, `.jpeg`, `.webp`	20 MB

Uploading a document

Upload a document via the ingestion endpoint:

Upload a document

curl -X POST https://api.dokubrain.com/api/v1/ingestion \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@contract.pdf"

Optionally assign to a project at upload time:

Upload to a project

curl -X POST https://api.dokubrain.com/api/v1/ingestion \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@contract.pdf" \
  -F "project_id=proj_abc123"

Processing pipeline

When you upload a document, DokuBrain automatically:

Parses — extracts raw text and structure from the file
Classifies — identifies the document type (invoice, contract, receipt, etc.)
Extracts — pulls structured fields based on the detected type
Embeds — generates vector embeddings for semantic search
Summarizes — creates an AI-powered summary

You can check the processing status by fetching the document:

Check document status

curl https://api.dokubrain.com/api/v1/documents/doc_abc123 \
  -H "Authorization: Bearer YOUR_API_KEY"

Document classification

DokuBrain automatically classifies documents into 16+ types:

Invoice, Receipt, Purchase Order
Contract, Agreement, NDA
Bank Statement, Financial Report
Tax Return, W-2, 1099
Employment Letter, Pay Stub
Insurance Policy, Claim
Government ID, Passport
Utility Bill, Proof of Address

Field extraction

Extract structured fields using built-in or custom templates:

Extract with a built-in template

curl -X POST https://api.dokubrain.com/api/v1/documents/doc_abc123/extract \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "templateId": "invoice" }'

Built-in extraction templates

Template	Fields extracted
`invoice`	Vendor, invoice number, date, amounts, line items, tax
`contract`	Parties, effective date, termination date, key clauses
`receipt`	Merchant, date, total, payment method, items
`bank_statement`	Account holder, period, opening/closing balance, transactions
`tax_return`	Filer, tax year, income, deductions, tax owed
`employment_letter`	Employee, employer, position, salary, start date

You can also create custom templates for your specific document types.

Document comparison

Compare two documents to find differences at the chunk level:

Compare two documents

curl -X POST https://api.dokubrain.com/api/v1/documents/compare \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "sourceDocumentId": "doc_abc123",
    "targetDocumentId": "doc_def456"
  }'

Summarization

Generate summaries in multiple formats:

Summarize a document

curl -X POST https://api.dokubrain.com/api/v1/documents/doc_abc123/summarize \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "format": "bullet_points",
    "maxLength": 500
  }'

Supported formats: paragraph, bullet_points, executive_summary, key_findings.

Document Processing

On this page