We tried this and a few other tools, but we ended up with PDF2XL[1] (which works on everything, not just tables).
It's pretty ugly and not cheap, but the data extraction is absolutely magical.
I very rarely feel joy and excitement when using a tool, especially a PDF-related tool, but it saved our dev team at least 100 hours when we first used it. We have it as an automated part of one of our client flows and they happily pay us way more than they should.
I used Tabula quite a bit at the startup I used to work at (> 3 yrs ago). We were curating and organizing genetic testing information and much of the data was sent to us with PDFs.
It didn't work everytime, but when it did, it was awesome!
I've done similar extraction with pdfminer[1]. It lets you walk through a PDF document with various callbacks for different objects. I found that with tables that were all from a consistent source it was pretty easy to come up with a pattern that would match the tables.
It's been a while since I've tried to run Java on Windows. Is there a way to install a JRE without any fuss, that won't phone home or nag me for updates?
Hey I've used this at work! Love it. Pdfplumber is very useful too. Although these tools are only useful when the text in a PDF is "selectable". If someone scans a document to PDF, there may be no text in the file for these tools to extract.
I used tabula (and later camelot) to extract some register maps from data sheets to automate making some basic drivers!
What an awesome tool. I had to post process but both were consistent enough in how they were wrong that it was still faster to extract the register maps
It's pretty ugly and not cheap, but the data extraction is absolutely magical.
I very rarely feel joy and excitement when using a tool, especially a PDF-related tool, but it saved our dev team at least 100 hours when we first used it. We have it as an automated part of one of our client flows and they happily pay us way more than they should.
1. https://pdf2xl.com/