Very cool, thanks for sharing. I'm guessing it doesn't do OCR yet? FWIW, you may be interested in these similar projects, which are popular in the journalism community though they don't provide the same high-level interface or data-inference, just the PDF-to-delimited text processing:
OCR is left out as a possible future extension, which is why I got interested in this comparison. Thanks, I didn't know about pdfplumber! The utilization of additional markup like vertical lines from pdfminer is very interesting. Razr uses poppler tools with text-only conversion but from which it automatically extracts column names and types.
Similar to plumber and opposed to Tabula, the goal was to extract tables from a swath of documents without user intervention. Additionally, no knowledge about the location tables in the document is required. A fully automated workflow would curl -X POST localhost/analyze/... and filter down the json to the type or types of tables needed (via context lines, data types, column headers).
- http://tabula.technology/ (Java)
- https://github.com/jsvine/pdfplumber (pure Python as well)