Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Tabula – Extract tables from PDF files (github.com/tabulapdf)
148 points by pabs3 on June 7, 2021 | hide | past | favorite | 22 comments


We tried this and a few other tools, but we ended up with PDF2XL[1] (which works on everything, not just tables).

It's pretty ugly and not cheap, but the data extraction is absolutely magical.

I very rarely feel joy and excitement when using a tool, especially a PDF-related tool, but it saved our dev team at least 100 hours when we first used it. We have it as an automated part of one of our client flows and they happily pay us way more than they should.

1. https://pdf2xl.com/


I used Tabula quite a bit at the startup I used to work at (> 3 yrs ago). We were curating and organizing genetic testing information and much of the data was sent to us with PDFs.

It didn't work everytime, but when it did, it was awesome!


Excel has some good (and scriptable) capabilities for this now.

https://techcommunity.microsoft.com/t5/excel-blog/announcing...


I've done similar extraction with pdfminer[1]. It lets you walk through a PDF document with various callbacks for different objects. I found that with tables that were all from a consistent source it was pretty easy to come up with a pattern that would match the tables.

1: https://pypi.org/project/pdfminer/


Camelot and it's Excalibur are great too! We used it to convert different bank statements


I guess these are the two projects you are talking about:

https://github.com/camelot-dev/camelot https://github.com/camelot-dev/excalibur


This is such a shitty problem. I tried doing some stuff with ghostscript etc.


It's been a while since I've tried to run Java on Windows. Is there a way to install a JRE without any fuss, that won't phone home or nag me for updates?


As close as you'll get

https://adoptopenjdk.net/


You could use WSL and just run Java on the Linux side.


You may also find useful: https://github.com/adworse/iguvium


This is neat. Over Docsumo [0] we have a combination of ML plus NLP+Computer Vision algorithms to detect the tables:

[0] - https://docsumo.com/free-tools/extract-tables-from-pdf-image...


Hey I've used this at work! Love it. Pdfplumber is very useful too. Although these tools are only useful when the text in a PDF is "selectable". If someone scans a document to PDF, there may be no text in the file for these tools to extract.


When I first saw the name I thought that this Tabula [1] had somehow come back from the dead and pivoted!

[1] https://en.m.wikipedia.org/wiki/Tabula_(company)


Or “Taboola”, the pushers of mobile ad fraud and scareware.


I used tabula (and later camelot) to extract some register maps from data sheets to automate making some basic drivers!

What an awesome tool. I had to post process but both were consistent enough in how they were wrong that it was still faster to extract the register maps


The most robust tool I found for this is https://pdf2spreadsheet.com/


Thanks! I learnt about Tabula from ex-colleague that he was able to successfully used to extract from reports.


Some of the text extracted by tabula contains unexpected whitespace:

"Jan uary"

Is there any reliable way to deal with this?


It’s a nice tool. Found it by chance a couple of days ago. It did save me a lot of typing.


Only on PDFs with textual data, not scanned PDFs


Tabula works.

So does printing the document to a file and Perl.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: