Initial Situation and Context
PDF is a presentation format; without structure-preserving preprocessing, hierarchies, table references, and context are lost. Reliable extraction therefore requires a preliminary stage that systematically converts layout information into structured data.
Parsing into a structured target format (e.g. Markdown with tables) is the necessary first stage; only upon this foundation can fields be extracted reliably, reproducibly, and with measurable quality.
Table of Contents
Parsing first
Strategy, technology, risks
Vision models
Scans, complex layouts, diagrams
Structural information
Tables as Markdown
Pros & Cons
Pros/Cons per tool
Decision guide
Use case matrix, comparison table
Conclusion
Recommendations by use case
1. Parsing first
Copy & paste typically leads to a loss of structure. Robust parsing into Markdown preserves headings, lists, and tables, representing the document as a data structure – the foundation for reliable extraction.
Convert PDFs into Markdown and only then extract targeted fields (e.g. invoice_number, total_amount, iban). This minimises error rates because structure and context are preserved.
2. Vision Models for Scans and Complex Layouts
With scans, handwritten notes, and multi-column layouts, text-based parsers reach their limits. Vision models take spatial structure into account (columns, image-text relationships, diagrams) and increase robustness.
| Category | Text-based Parsers | Vision Models |
|---|---|---|
| Multi-column Reports | ≈ unreliable | robust |
| Scans / OCR | limited | required |
| Diagrams/Graphics | ignored | context-aware |
3. Structural Information is Decisive
Numbers are difficult to evaluate without row/column references. Modern parsers convert tables directly into Markdown – machine-readable, versionable, and unambiguous for subsequent steps.
Tables as Markdown enable precise, repeatable analyses – from financial reports to academic studies. Only with this do automated trend analyses become truly reliable.
Feature Comparison
Table 1: General Features and Deployment
| Tool | Primary Use Case | Deployment Model | Pricing | Known Integrations |
|---|---|---|---|---|
| LlamaParse | Parsing complex PDFs for RAG pipelines | Cloud API | Tiered ($3–$45 / 1,000 pages) | LlamaIndex, n8n, OpenAI |
| Unstructured.io | Document parsing for LLM applications | Open Source & Cloud API | Tiered (Advanced: $20 / 1,000 p.) | LangChain |
| Vectorize.io | RAG-as-a-Service platform | Cloud platform | Cost-effective ($0–$15 / 1,000 p. in pipeline) | Google Drive, S3 |
| Docling | Local, data privacy compliant document parsing | Open Source (local) | Free (Open Source) | LangChain, Llama Index |
| MarkItDown | Fast Office→Markdown conversion | Open Source (local) | Free (Open Source) | CLI, Python API |
| Stirling PDF | Comprehensive PDF editing (self-hosted) | Open Source (local) | Free (Open Source) | Docker |
| Unstract | No-code automation of document workflows | Open Source & Cloud platform | Trial phase, then hosted | Various LLMs & Vector DBs |
Table 2: Technical Extraction Capabilities
| Tool | Output Formats | Table & Diagram Recognition | Handling of Scans/OCR | Multilingualism |
|---|---|---|---|---|
| LlamaParse | Markdown, Text, JSON | Excellent (Diagrams → Tables possible) | Yes (OCR option) | In test Arabic: Fair |
| Unstructured.io | Markdown etc. | Moderate (Layout often lost) | Yes (higher pricing tiers) | In test Arabic: Poor |
| Vectorize.io | Markdown etc. | Excellent (Vision model "Iris") | Excellent (including skewed scans) | Very good (50+), Arabic: Good |
| Docling | Markdown, JSON (Docling object) | Excellent (Table Former) | Very good (Layout Analysis) | Unknown |
| MarkItDown | Markdown | Converts Excel tables cleanly | Not the focus | Unknown |
| Stirling PDF | PDF, Text (OCR) | No (no layout extraction) | Yes (OCR layer) | Multilingual OCR |
| Unstract | Text, structured JSON | Dependent on extractor | Yes (e.g. via LLM Whisperer) | Dependent on LLM |
Stress Test: Performance Comparison in Five Disciplines
Discipline 1: Multi-column Layouts
- Unstructured: Excellent — correct separation and reading order
- Vectorize: Good — robust results
- LlamaParse: Fair — columns mixed up, risk of unusable RAG data
Discipline 2: Complex Layouts with Images
- Vectorize: Excellent — clean segmentation into Markdown
- LlamaParse: Good — solid separation
- Unstructured: Poor — contents mixed up
Discipline 3: Scanned and Skewed Documents
- Vectorize: Excellent — very robust OCR/normalisation
- LlamaParse: Good — minor recognition errors (e.g. date)
- Unstructured: Poor — no usable output
Discipline 4: Financial Reports with Many Tables
- LlamaParse: Excellent — very good table structure
- Vectorize: Excellent — clean, machine-readable tables
- Unstructured: Fair — text without structural reference
Discipline 5: Non-English Documents (Arabic)
- Vectorize: Good — correct words and reading direction (RTL)
- LlamaParse: Fair — words ok, reading direction inverted
- Unstructured: Poor — insufficient results
Pros and Cons at a Glance
| Tool | Main Advantages | Main Disadvantages | Ideal Scenario |
|---|---|---|---|
| LlamaParse | Very good tables/diagrams; API integration | Premium pricing; weaker with multi-columns; moderate with RTL | Table-heavy sources (e.g. financial reports) in RAG |
| Unstructured.io | Open Source; good for simple texts; LangChain ecosystem | Weak with complex layouts/scans; limited multilingualism | Simple, text-based PDFs; OS flexibility |
| Vectorize.io | Consistently strong; vision model; good for scans/multilingualism; cost-effective | Only as part of the platform; no standalone API | Demanding end-to-end RAG pipelines |
| Docling | Local; privacy-friendly; excellent structure/tables; no GPU needed | Requires Python/Infra; no no-code UI | Customisable, sovereign parsing pipelines on-prem |
| MarkItDown | Very fast; lightweight; many Office formats | No advanced layout parsing; no OCR | Fast Markdown conversion of standardised documents |
| Stirling PDF | Broad feature set; self-hostable; free | No specialised RAG parsing engine | General PDF tasks with data sovereignty |
| Unstract | No-code; orchestration; LLM challenge | Quality dependent on backend extractor | Teams without much coding for ETL/workflows |
Decision Guide by Use Case
The selection depends on the underlying conditions (data situation, data privacy, precision, operations):
| Use Case | Recommendation | Data Sovereignty | Effort | Precision |
|---|---|---|---|---|
| Ad-hoc PDF tasks (on-prem) | Stirling PDF | high | low | medium |
| In-house parsing pipeline | Docling | high | medium | high |
| API for complex documents | Vectorize Iris | medium | low | very high |
| No-code workflows (business) | Unstract | medium | low | high |
Decision Tree: Pipeline at a Glance
Parsing-first pipeline: robust, reproducible, scalable
Conclusion
- Complex RAG pipelines: Vectorize (consistently strong); Alternative: LlamaParse (for tables)
- Open Source for developers: Docling (local, structure-preserving, highly integrable)
- Fast Markdown conversion: MarkItDown
- Swiss Army knife on-prem: Stirling PDF
- No-code workflows/ETL: Unstract
The quality of parsing determines the success rate of the pipeline. Decide between managed services (convenience/performance) and open-source sovereignty (control/effort) in accordance with your technical and regulatory requirements.