Document Processing
Upload, classify, extract, and summarize documents with DokuBrain.
Document Processing
DokuBrain processes documents through a pipeline: upload → parse → classify → extract → embed. Each step runs automatically when you upload a document.
Supported file types
| Format | Extensions | Max size |
|---|---|---|
.pdf | 50 MB | |
| Word | .docx | 50 MB |
| HTML | .html, .htm | 10 MB |
| Plain text | .txt | 10 MB |
.eml | 25 MB | |
| Images | .png, .jpg, .jpeg, .webp | 20 MB |
Uploading a document
Upload a document via the ingestion endpoint:
Optionally assign to a project at upload time:
Processing pipeline
When you upload a document, DokuBrain automatically:
- Parses — extracts raw text and structure from the file
- Classifies — identifies the document type (invoice, contract, receipt, etc.)
- Extracts — pulls structured fields based on the detected type
- Embeds — generates vector embeddings for semantic search
- Summarizes — creates an AI-powered summary
You can check the processing status by fetching the document:
Document classification
DokuBrain automatically classifies documents into 16+ types:
- Invoice, Receipt, Purchase Order
- Contract, Agreement, NDA
- Bank Statement, Financial Report
- Tax Return, W-2, 1099
- Employment Letter, Pay Stub
- Insurance Policy, Claim
- Government ID, Passport
- Utility Bill, Proof of Address
Field extraction
Extract structured fields using built-in or custom templates:
Built-in extraction templates
| Template | Fields extracted |
|---|---|
invoice | Vendor, invoice number, date, amounts, line items, tax |
contract | Parties, effective date, termination date, key clauses |
receipt | Merchant, date, total, payment method, items |
bank_statement | Account holder, period, opening/closing balance, transactions |
tax_return | Filer, tax year, income, deductions, tax owed |
employment_letter | Employee, employer, position, salary, start date |
You can also create custom templates for your specific document types.
Document comparison
Compare two documents to find differences at the chunk level:
Summarization
Generate summaries in multiple formats:
Supported formats: paragraph, bullet_points, executive_summary, key_findings.