← Back to blog

Common PDF to JSON Errors: Solutions

When you convert PDFs to JSON, problems can arise. Here are the most common errors and how to solve them.

Error 1: Incorrectly Extracted Text

Symptom: Text has weird characters, missing spaces, or incomplete words.

Original PDF: "Invoice #2024-001"
Extracted JSON: "Invoice #2024-00l" (0 as L)

Causes:

  • PDF is a scanned image (OCR)
  • PDF font with special characters
  • Low scan quality

Solution:

  1. Verify the PDF is text-based, not image
  2. If scanned, use tools with OCR
  3. Increase scan resolution (300+ DPI)
  4. Validate afterwards with regex

Error 2: Misaligned Tables

Symptom: Table columns appear in wrong order or data is mixed up.

// Expected:
"table": {
  "headers": ["Name", "Age"],
  "rows": [["John", 30]]
}

// Obtained:
"table": {
  "headers": ["Name", "Age"],
  "rows": [["30", "John"]] // Columns reversed
}

Causes:

  • PDF with irregular columns
  • Variable whitespace
  • Text with line breaks inside cells

Solution:

  1. Inspect the PDF manually
  2. Specify expected column order
  3. Use regex to reorder
  4. Consider converting PDF to Excel first

Error 3: Invalid JSON

Symptom: You can't parse JSON, you get a syntax error.

// Invalid: unescaped quotes
{
  "text": "He said: "Hello""
}

// Valid:
{
  "text": "He said: \"Hello\""
}

Cause: Special characters not escaped.

Solution:

// Validate at jsonlint.com
const validateJSON = (str) => {
  try {
    JSON.parse(str);
    return true;
  } catch (e) {
    console.log('Invalid JSON:', e.message);
    return false;
  }
};

Error 4: Missing Fields

Symptom: JSON doesn't include data that's clearly in the PDF.

Causes:

  • Text in light colors (extracted as background)
  • Data in images or charts
  • Rotated or special format text

Solution:

  1. Check if text is in an image
  2. Use OCR if necessary
  3. Extract critical fields manually
  4. Verify the PDF is readable

Error 5: Incorrect Numbers

Symptom: Numbers are extracted as text or have incorrect decimals.

PDF: 1.234,56 (European format)
JSON: "1.234,56" (text instead of number)

Solution:

const parseNumber = (str) => {
  // Detect decimal separator
  if (str.includes(',') && str.includes('.')) {
    // European format: 1.234,56
    return parseFloat(str.replace(/\./g, '').replace(',', '.'));
  }
  return parseFloat(str);
};

Error 6: Incorrect Date Format

Symptom: Dates aren't recognized or are in wrong format.

Solution:

const parseDate = (str) => {
  // Try multiple formats
  const formats = [
    /(\d{2})\/(\d{2})\/(\d{4})/, // DD/MM/YYYY
    /(\d{4})-(\d{2})-(\d{2})/,   // YYYY-MM-DD
    /(\d{2})-(\d{2})-(\d{4})/    // DD-MM-YYYY
  ];

  for (let fmt of formats) {
    const match = str.match(fmt);
    if (match) return new Date(match[0]);
  }
  return null;
};

Error 7: Character Encoding

Symptom: Accents and special characters appear as ???? or weird symbols.

Cause: UTF-8 encoding not recognized.

Solution:

// Ensure UTF-8
const sanitize = (str) => {
  return str.normalize('NFKC').replace(/[^\x20-\x7E\u0080-\uFFFF]/g, '');
};

Validation Checklist

Before using JSON in production:

  • [ ] Is the JSON valid? (jsonlint.com)
  • [ ] Are all numbers type number?
  • [ ] Do dates have consistent format?
  • [ ] Are there unescaped special characters?
  • [ ] Do tables have headers and rows?
  • [ ] Are critical fields present?

Post-Conversion Optimization

const cleanupJSON = (json) => {
  return {
    ...json,
    // Remove empty fields
    fields: Object.fromEntries(
      Object.entries(json.fields || {}).filter(([, v]) => v !== null && v !== '')
    )
  };
};

Next Steps