Tabula – Extract tables from PDF files

smt88 · on June 7, 2021

We tried this and a few other tools, but we ended up with PDF2XL[1] (which works on everything, not just tables).

It's pretty ugly and not cheap, but the data extraction is absolutely magical.

I very rarely feel joy and excitement when using a tool, especially a PDF-related tool, but it saved our dev team at least 100 hours when we first used it. We have it as an automated part of one of our client flows and they happily pay us way more than they should.

1. https://pdf2xl.com/

tayloramurphy · on June 7, 2021

I used Tabula quite a bit at the startup I used to work at (> 3 yrs ago). We were curating and organizing genetic testing information and much of the data was sent to us with PDFs.

It didn't work everytime, but when it did, it was awesome!

phonon · on June 8, 2021

Excel has some good (and scriptable) capabilities for this now.

https://techcommunity.microsoft.com/t5/excel-blog/announcing...

aidenn0 · on June 8, 2021

I've done similar extraction with pdfminer[1]. It lets you walk through a PDF document with various callbacks for different objects. I found that with tables that were all from a consistent source it was pretty easy to come up with a pattern that would match the tables.

1: https://pypi.org/project/pdfminer/

gitowiec · on June 7, 2021

Camelot and it's Excalibur are great too! We used it to convert different bank statements

pabs3 · on June 8, 2021

I guess these are the two projects you are talking about:

https://github.com/camelot-dev/camelot https://github.com/camelot-dev/excalibur

excitednumber · on June 8, 2021

This is such a shitty problem. I tried doing some stuff with ghostscript etc.

gary_0 · on June 8, 2021

It's been a while since I've tried to run Java on Windows. Is there a way to install a JRE without any fuss, that won't phone home or nag me for updates?

scoopertrooper · on June 8, 2021

As close as you'll get

https://adoptopenjdk.net/

pabs3 · on June 8, 2021

You could use WSL and just run Java on the Linux side.

petalmind · on June 7, 2021

You may also find useful: https://github.com/adworse/iguvium

nishparadox · on June 8, 2021

This is neat. Over Docsumo [0] we have a combination of ML plus NLP+Computer Vision algorithms to detect the tables:

[0] - https://docsumo.com/free-tools/extract-tables-from-pdf-image...

_8j7i · on June 8, 2021

Hey I've used this at work! Love it. Pdfplumber is very useful too. Although these tools are only useful when the text in a PDF is "selectable". If someone scans a document to PDF, there may be no text in the file for these tools to extract.

tn1 · on June 8, 2021

When I first saw the name I thought that this Tabula [1] had somehow come back from the dead and pivoted!

[1] https://en.m.wikipedia.org/wiki/Tabula_(company)

sloshnmosh · on June 8, 2021

Or “Taboola”, the pushers of mobile ad fraud and scareware.

cjwoodall · on June 8, 2021

I used tabula (and later camelot) to extract some register maps from data sheets to automate making some basic drivers!

What an awesome tool. I had to post process but both were consistent enough in how they were wrong that it was still faster to extract the register maps

nside · on June 8, 2021

The most robust tool I found for this is https://pdf2spreadsheet.com/

mohanmca · on June 8, 2021

Thanks! I learnt about Tabula from ex-colleague that he was able to successfully used to extract from reports.

kbouck · on June 8, 2021

Some of the text extracted by tabula contains unexpected whitespace:

"Jan uary"

Is there any reliable way to deal with this?

geonic · on June 7, 2021

It’s a nice tool. Found it by chance a couple of days ago. It did save me a lot of typing.

villgax · on June 8, 2021

Only on PDFs with textual data, not scanned PDFs

LVDOVICVS · on June 7, 2021

Tabula works.

So does printing the document to a file and Perl.