Common PDF to JSON Errors: Solutions
When you convert PDFs to JSON, problems can arise. Here are the most common errors and how to solve them.
Error 1: Incorrectly Extracted Text
Symptom: Text has weird characters, missing spaces, or incomplete words.
Original PDF: "Invoice #2024-001"
Extracted JSON: "Invoice #2024-00l" (0 as L)
Causes:
- PDF is a scanned image (OCR)
- PDF font with special characters
- Low scan quality
Solution:
- Verify the PDF is text-based, not image
- If scanned, use tools with OCR
- Increase scan resolution (300+ DPI)
- Validate afterwards with regex
Error 2: Misaligned Tables
Symptom: Table columns appear in wrong order or data is mixed up.
// Expected:
"table": {
"headers": ["Name", "Age"],
"rows": [["John", 30]]
}
// Obtained:
"table": {
"headers": ["Name", "Age"],
"rows": [["30", "John"]] // Columns reversed
}
Causes:
- PDF with irregular columns
- Variable whitespace
- Text with line breaks inside cells
Solution:
- Inspect the PDF manually
- Specify expected column order
- Use regex to reorder
- Consider converting PDF to Excel first
Error 3: Invalid JSON
Symptom: You can't parse JSON, you get a syntax error.
// Invalid: unescaped quotes
{
"text": "He said: "Hello""
}
// Valid:
{
"text": "He said: \"Hello\""
}
Cause: Special characters not escaped.
Solution:
// Validate at jsonlint.com
const validateJSON = (str) => {
try {
JSON.parse(str);
return true;
} catch (e) {
console.log('Invalid JSON:', e.message);
return false;
}
};
Error 4: Missing Fields
Symptom: JSON doesn't include data that's clearly in the PDF.
Causes:
- Text in light colors (extracted as background)
- Data in images or charts
- Rotated or special format text
Solution:
- Check if text is in an image
- Use OCR if necessary
- Extract critical fields manually
- Verify the PDF is readable
Error 5: Incorrect Numbers
Symptom: Numbers are extracted as text or have incorrect decimals.
PDF: 1.234,56 (European format)
JSON: "1.234,56" (text instead of number)
Solution:
const parseNumber = (str) => {
// Detect decimal separator
if (str.includes(',') && str.includes('.')) {
// European format: 1.234,56
return parseFloat(str.replace(/\./g, '').replace(',', '.'));
}
return parseFloat(str);
};
Error 6: Incorrect Date Format
Symptom: Dates aren't recognized or are in wrong format.
Solution:
const parseDate = (str) => {
// Try multiple formats
const formats = [
/(\d{2})\/(\d{2})\/(\d{4})/, // DD/MM/YYYY
/(\d{4})-(\d{2})-(\d{2})/, // YYYY-MM-DD
/(\d{2})-(\d{2})-(\d{4})/ // DD-MM-YYYY
];
for (let fmt of formats) {
const match = str.match(fmt);
if (match) return new Date(match[0]);
}
return null;
};
Error 7: Character Encoding
Symptom: Accents and special characters appear as ???? or weird symbols.
Cause: UTF-8 encoding not recognized.
Solution:
// Ensure UTF-8
const sanitize = (str) => {
return str.normalize('NFKC').replace(/[^\x20-\x7E\u0080-\uFFFF]/g, '');
};
Validation Checklist
Before using JSON in production:
- [ ] Is the JSON valid? (jsonlint.com)
- [ ] Are all numbers type
number? - [ ] Do dates have consistent format?
- [ ] Are there unescaped special characters?
- [ ] Do tables have headers and rows?
- [ ] Are critical fields present?
Post-Conversion Optimization
const cleanupJSON = (json) => {
return {
...json,
// Remove empty fields
fields: Object.fromEntries(
Object.entries(json.fields || {}).filter(([, v]) => v !== null && v !== '')
)
};
};
Next Steps
- Read Basic Guide
- Learn Advanced Techniques
- Discover Real Use Cases