Apache Tika – a content analysis toolkit

geerlingguy · on Sept 14, 2020

Tika has been around for ages, and I remember many of the early versions (probably up to 1.2 or 1.3) would completely explode if you threw in a PDF with some UTF-8 characters, or Word documents with many foreign/non-ASCII languages.

Thankfully nowadays the problem I run into the most is memory exhaustion when some client uploads a 500+ MB PDF and expects my cheap Solr SaaS (https://hostedapachesolr.com) to handle extraction for these giant files!

arafalov · on Sept 14, 2020

Yeah, we are seeing the same problem in Solr User Group discussions. That Tika integration will need to change to something else (with Standalone Tika). Of course, Drupal's Solr module already allows to provide separate Tika, but I guess there is no equivalent Tika SAAS service, so they just point it all at your service.

zkanda · on Sept 14, 2020

We had a problem where we need to index and make searchable a hundred of thousands of government pdf files, some are as old as 15 years ago.

Tried a bunch libraries and settled with Tika. Although we were a PHP/Node shop, nothing could be compared to the ease of using Tika for this exact purpose.

GEBBL · on Sept 14, 2020

A timely link given there is a current discussion on the trouble with extracting meaningful text from pdfs in another thread on the front page! I look forward to reading the feedback on actual use of this

mumblemumble · on Sept 14, 2020

Tika's PDF text extraction is fine if you're just trying to get searchable text. Which is what it's made for: Slurping doucments into Lucene. Fulltext search typically isn't terribly sensitive to getting the order of words right, and is even less sensitive to getting the formatting right.

If you're trying to get something fit for consumption by human (including via a screen reader) or an NLP pipeline, though, all the problems discussed in that FilingDB article still apply.

tonitosou · on Sept 14, 2020

aspose can convert pdf to html

chaps · on Sept 14, 2020

Firefox's PDF reader does the same. A few years back, I wrote wrote a pdf to csv converter with selenium -- it worked surprisingly well! Though, after I finished I found tabula and the code became immediately useless, hah.

rovr138 · on Sept 14, 2020

When that happens to me, it's due to a keyword I hadn't thought of looking for. Then, while building the project, it came to me.

Ocelot20 · on Sept 14, 2020

I used Tika to build a search engine prototype, and it was fantastic for getting us up and running quickly.

It's a really easy to use generic parser for a bunch of document types. The downside of being so generic and easy to use is that you end up lacking document-specific context that could be useful. For example: Do you consider the header/footer text to be important, or just noise (Page 1, Page 2, etc.)? Is the text contained in the Table of Contents or section headers important, or just the actual content? You won't find any ways to tweak the result, which could be a good or bad thing depending on your use case.

We ended up using it as our "fallback" parser, writing more contextually aware ones for document types of greater importance to our use case (PDFs were high on the list).

tonitosou · on Sept 14, 2020

so how did u do to understand the "form" of the document such as table of contents and co.

nayuki · on Sept 14, 2020

For reference, it's https://news.ycombinator.com/item?id=24460142 ; https://filingdb.com/b/pdf-text-extraction

iav · on Sept 14, 2020

I’ve found Tika’s PDF to HTML parser to be pretty good. My only complaint is that in a double spaced document where there is equal amount of space between paragraphs and normal lines, it labels every line as a separate paragraph.

lazycrazyowl · on Sept 14, 2020

Apache UIMA played a key roles in the data intelligence and analytic proficiency of the IBM Watson supercomputer, playing against human champions on the TV show "Jeopardy!” and uses Tika for UIMA annotation.

https://blogs.apache.org/foundation/entry/apache_innovation_...

akerro · on Sept 14, 2020

Is there any website, table or whatever that compiles all/most apache projects and described them in one or two sentences?

Nifi, Flink, Mesos, Kafka, Cascoon, Cordova, Hadoop.

capableweb · on Sept 14, 2020

Saw the other comments here, didn't find a good place either. Got inspired by the discussion here to write a quick Figma plugin to take JSON input and generate elements based on it. Using it together with the instances is amazing, very quick to prototype stuff.

But, hosting the output was more tricky. PDF export ended up 140MB. PNG ended up 10MB but poor quality. SVG is perfect! But only found one service to host it, and seems even this SVG is too large for them to handle! (50MB)

So, anywhere I can host big SVGs? Then I could publish the quick hack I made to get the title, programming language + description from the projects list and into something you can quickly scan over.

Edit: Aha, found one host that was OK with the file size of a bigger PNG (image is 16806x9984, keep in mind) https://de.catbox.moe/espesr.png Also keep in mind, I'm not a designer, so it is what it is.

Will try to upload the SVG that is a bit more friendly on the data and performance. Edit2: SVG version https://de.catbox.moe/qppy6t.svg

arafalov · on Sept 14, 2020

It is a VERY long list: https://projects.apache.org/projects.html

akerro · on Sept 14, 2020

I'm also aware of this one, doesn't have descriptions.

arafalov · on Sept 14, 2020

The descriptions are behind the link, driven by the DOAP files each project maintain.

But I agree, it would have been nice to have a version of this page with first para of the description inlined in the listing itself.

drej · on Sept 14, 2020

https://en.wikipedia.org/wiki/List_of_Apache_Software_Founda...

akerro · on Sept 14, 2020

Yea, I know about this one where 1/3 is undocumented and sentences are non for technical people or super generic.

dariusj18 · on Sept 14, 2020

Be the change you want to see in the world.

dariusj18 · on Sept 14, 2020

Found this

https://projects.apache.org/json/foundation/projects.json

as the data source for this page

https://projects.apache.org/projects.html

liability · on Sept 14, 2020

I just tried this out on a handful of PDFs, comparing it to Calibre's `ebook-convert`

They seem roughly equivalent, neither better than the other. Particularly, both fail the same dehyphenations, a category of error that's extremely frustrating for text-to-speech users. By default tika seems more aggressive when joining split lines, but without good dehyphenation that's not worth much, and some of the lines it joins shouldn't be joined.

Calibre's is also several times faster, but the difference between 1 second and 7 seconds isn't really a big deal for my purposes.

merricksb · on Sept 14, 2020

If curious, see (small) discussion on HN from 8 years ago:

https://news.ycombinator.com/item?id=3878054

mikhailfranco · on Sept 15, 2020

Is Tika any good for MS Word structure?

Apache POI is amazingly complex and undocumented. It is one of the few Java libraries I have ever seen, where classes have setters but no getters, and the hacks you find on Stackoverflow involve reflective traversals and coercion of access modifiers.

nunorbatista · on Sept 15, 2020

Last month I had to do a PDF parser and searched a lot for a solution like Tika, but strangely this didn't come up. Cool, I'll test.

heffer · on Sept 14, 2020

I only learned of Tika through Dovecot which can use it to include email attachments in it's Full Text Search index using it. Pretty neat stuff.