DokuBrain
Guides

Document Processing

Upload, classify, extract, and summarize documents with DokuBrain.

Document Processing

DokuBrain processes documents through a pipeline: upload → parse → classify → extract → embed. Each step runs automatically when you upload a document.

Supported file types

FormatExtensionsMax size
PDF.pdf50 MB
Word.docx50 MB
HTML.html, .htm10 MB
Plain text.txt10 MB
Email.eml25 MB
Images.png, .jpg, .jpeg, .webp20 MB

Uploading a document

Upload a document via the ingestion endpoint:

Upload a document
curl -X POST https://api.dokubrain.com/api/v1/ingestion \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@contract.pdf"

Optionally assign to a project at upload time:

Upload to a project
curl -X POST https://api.dokubrain.com/api/v1/ingestion \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@contract.pdf" \
  -F "project_id=proj_abc123"

Processing pipeline

When you upload a document, DokuBrain automatically:

  1. Parses — extracts raw text and structure from the file
  2. Classifies — identifies the document type (invoice, contract, receipt, etc.)
  3. Extracts — pulls structured fields based on the detected type
  4. Embeds — generates vector embeddings for semantic search
  5. Summarizes — creates an AI-powered summary

You can check the processing status by fetching the document:

Check document status
curl https://api.dokubrain.com/api/v1/documents/doc_abc123 \
  -H "Authorization: Bearer YOUR_API_KEY"

Document classification

DokuBrain automatically classifies documents into 16+ types:

  • Invoice, Receipt, Purchase Order
  • Contract, Agreement, NDA
  • Bank Statement, Financial Report
  • Tax Return, W-2, 1099
  • Employment Letter, Pay Stub
  • Insurance Policy, Claim
  • Government ID, Passport
  • Utility Bill, Proof of Address

Field extraction

Extract structured fields using built-in or custom templates:

Extract with a built-in template
curl -X POST https://api.dokubrain.com/api/v1/documents/doc_abc123/extract \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "templateId": "invoice" }'

Built-in extraction templates

TemplateFields extracted
invoiceVendor, invoice number, date, amounts, line items, tax
contractParties, effective date, termination date, key clauses
receiptMerchant, date, total, payment method, items
bank_statementAccount holder, period, opening/closing balance, transactions
tax_returnFiler, tax year, income, deductions, tax owed
employment_letterEmployee, employer, position, salary, start date

You can also create custom templates for your specific document types.

Document comparison

Compare two documents to find differences at the chunk level:

Compare two documents
curl -X POST https://api.dokubrain.com/api/v1/documents/compare \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "sourceDocumentId": "doc_abc123",
    "targetDocumentId": "doc_def456"
  }'

Summarization

Generate summaries in multiple formats:

Summarize a document
curl -X POST https://api.dokubrain.com/api/v1/documents/doc_abc123/summarize \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "format": "bullet_points",
    "maxLength": 500
  }'

Supported formats: paragraph, bullet_points, executive_summary, key_findings.

On this page