During my time at KbLab, I wrote code to transform raw PDF scans into structured and analysis ready datasets, which i’ve compiled into a package for everyone to use. Although designed for researchers working with print-first documents, It can be modified as a pre-processing step for the curation of a knowledge base, for instance for graph or RAG use-cases.
Architecture
My workflow consisted of four modules:
- Document Ingestion - Handles raw PDF scans and digitized documents from various sources.
- OCR Processing - Uses Tesseract for text extraction and segmentation.
- Quality validation - Confidence intervals and OCR quality metrics to filter out poor extractions and highlight problematic sections of the page.
- Outputs datasets - Segments the OCR output into structured datasets using python libraries lxml and NLTK.
Notes
This project was created for historical document curation but can be adapted for document processing workflows, such as pre-processing for RAG or graph-related applications.