Advanced PDF to JSON Conversion: Techniques and Automation
Once you've mastered basic PDF to JSON conversion, it's time to learn advanced techniques for complex cases and large-scale automation.
Complex PDFs: Advanced Strategies
Nested and Hierarchical Tables
Some PDFs contain tables within tables or hierarchical structures:
{
"table": {
"headers": ["Category", "Items"],
"rows": [
{
"category": "Q1 Sales",
"subtable": {
"items": [
{"product": "A", "amount": 1000},
{"product": "B", "amount": 2000}
]
}
}
]
}
}
Solution: Configure recursive extraction to detect nested structures.
OCR and Scanned Text
Scanned PDFs (images) require OCR:
- Quality: 300+ DPI - Better recognition
- Language: Specify expected languages
- Post-processing: Fix common OCR errors
// Fix common OCR errors
const fixOCR = (text) => {
return text
.replace(/[0O]+/g, '0') // O confused with 0
.replace(/[Il1]+/g, 'l') // Numbers with letters
.replace(/\s+/g, ' '); // Multiple spaces
};
Regular Expressions for Extraction
Extract specific fields using patterns:
// Extract email from PDF
const emails = /[\w\.-]+@[\w\.-]+\.\w+/g;
// Extract phone numbers
const phones = /\+?[\d\s\-()]{10,}/g;
// Extract dates (DD/MM/YYYY)
const dates = /(\d{2})\/(\d{2})\/(\d{4})/g;
Large-Scale Automation
Process Hundreds of PDFs
For production workflows:
#!/bin/bash
# Process PDFs in folder
for pdf in *.pdf; do
echo "Processing $pdf..."
curl -X POST -F "file=@$pdf" \
https://files-to.com/api/pdf/to-json \
> "${pdf%.pdf}.json"
done
Python Integration
import requests
import json
import os
def batch_convert_pdfs(folder_path):
for pdf_file in os.listdir(folder_path):
if pdf_file.endswith('.pdf'):
with open(f'{folder_path}/{pdf_file}', 'rb') as f:
response = requests.post(
'https://files-to.com/api/pdf/to-json',
files={'file': f}
)
# Save JSON
json_data = response.json()
with open(f'{pdf_file}.json', 'w') as out:
json.dump(json_data, out, indent=2)
Data Validation
Verify the quality of extracted JSON:
const validateExtraction = (json) => {
const issues = [];
// Check required fields
if (!json.document || !json.document.pages) {
issues.push('Invalid basic structure');
}
// Validate data types
json.document.pages.forEach((page, i) => {
if (typeof page.page_number !== 'number') {
issues.push(`Page ${i}: page_number is not a number`);
}
});
return issues.length > 0 ? issues : 'OK';
};
Advanced Use Cases
Multi-page Forms
Extract data from forms spanning multiple pages:
{
"form": {
"applicant": {
"name": "...",
"address": "...",
"phone": "..."
},
"pages": [
{ "page": 1, "section": "Personal Information" },
{ "page": 2, "section": "Work Experience" },
{ "page": 3, "section": "References" }
]
}
}
Multi-language PDFs
Detect and label sections in different languages:
{
"document": {
"language": "mixed",
"sections": [
{ "text": "...", "language": "en" },
{ "text": "...", "language": "es" }
]
}
}
Performance Optimization
- Process in parallel - Multiple PDFs simultaneously
- Cache results - Don't reconvert identical PDFs
- Compress JSON - Use gzip for transfers
- Clean data - Remove unnecessary fields
Next Steps
- Learn about Real-World Use Cases
- Solve Common Errors
- Back to Basic Guide