← Back to blog

Advanced PDF to JSON Conversion: Techniques and Automation

Once you've mastered basic PDF to JSON conversion, it's time to learn advanced techniques for complex cases and large-scale automation.

Complex PDFs: Advanced Strategies

Nested and Hierarchical Tables

Some PDFs contain tables within tables or hierarchical structures:

{
  "table": {
    "headers": ["Category", "Items"],
    "rows": [
      {
        "category": "Q1 Sales",
        "subtable": {
          "items": [
            {"product": "A", "amount": 1000},
            {"product": "B", "amount": 2000}
          ]
        }
      }
    ]
  }
}

Solution: Configure recursive extraction to detect nested structures.

OCR and Scanned Text

Scanned PDFs (images) require OCR:

  • Quality: 300+ DPI - Better recognition
  • Language: Specify expected languages
  • Post-processing: Fix common OCR errors
// Fix common OCR errors
const fixOCR = (text) => {
  return text
    .replace(/[0O]+/g, '0') // O confused with 0
    .replace(/[Il1]+/g, 'l') // Numbers with letters
    .replace(/\s+/g, ' '); // Multiple spaces
};

Regular Expressions for Extraction

Extract specific fields using patterns:

// Extract email from PDF
const emails = /[\w\.-]+@[\w\.-]+\.\w+/g;

// Extract phone numbers
const phones = /\+?[\d\s\-()]{10,}/g;

// Extract dates (DD/MM/YYYY)
const dates = /(\d{2})\/(\d{2})\/(\d{4})/g;

Large-Scale Automation

Process Hundreds of PDFs

For production workflows:

#!/bin/bash
# Process PDFs in folder
for pdf in *.pdf; do
  echo "Processing $pdf..."
  curl -X POST -F "file=@$pdf" \
    https://files-to.com/api/pdf/to-json \
    > "${pdf%.pdf}.json"
done

Python Integration

import requests
import json
import os

def batch_convert_pdfs(folder_path):
    for pdf_file in os.listdir(folder_path):
        if pdf_file.endswith('.pdf'):
            with open(f'{folder_path}/{pdf_file}', 'rb') as f:
                response = requests.post(
                    'https://files-to.com/api/pdf/to-json',
                    files={'file': f}
                )

                # Save JSON
                json_data = response.json()
                with open(f'{pdf_file}.json', 'w') as out:
                    json.dump(json_data, out, indent=2)

Data Validation

Verify the quality of extracted JSON:

const validateExtraction = (json) => {
  const issues = [];

  // Check required fields
  if (!json.document || !json.document.pages) {
    issues.push('Invalid basic structure');
  }

  // Validate data types
  json.document.pages.forEach((page, i) => {
    if (typeof page.page_number !== 'number') {
      issues.push(`Page ${i}: page_number is not a number`);
    }
  });

  return issues.length > 0 ? issues : 'OK';
};

Advanced Use Cases

Multi-page Forms

Extract data from forms spanning multiple pages:

{
  "form": {
    "applicant": {
      "name": "...",
      "address": "...",
      "phone": "..."
    },
    "pages": [
      { "page": 1, "section": "Personal Information" },
      { "page": 2, "section": "Work Experience" },
      { "page": 3, "section": "References" }
    ]
  }
}

Multi-language PDFs

Detect and label sections in different languages:

{
  "document": {
    "language": "mixed",
    "sections": [
      { "text": "...", "language": "en" },
      { "text": "...", "language": "es" }
    ]
  }
}

Performance Optimization

  • Process in parallel - Multiple PDFs simultaneously
  • Cache results - Don't reconvert identical PDFs
  • Compress JSON - Use gzip for transfers
  • Clean data - Remove unnecessary fields

Next Steps