Is uvx just part of uv? I keep a few python packages around via pipx (itself via homebrew) but am a big fan of uv for python projects… Do I just need to install uv globally (via brew?) to do this? Is there a mechanism to also have the installed utils available in my PATH (so I can invoke them without a uvx prefix)?
I worked on an in-house version of this feature for my employer (turning files into LLM-friendly text). After reading the source code, I can say this is a pretty reasonable implementation of this type of thing. But I would avoid using it for images, since the LLM providers let you just pass images directly, and I would also avoid using it for spreadsheets, since LLMs are very bad at interpreting Markdown tables.
There are a lot of random startups and open source projects who try to make this space sound fancy, but I really hope the end state is a simple project like this, easy to understand and easy to deploy.
I do wish it had a knob to turn for "how much processing do you want me to do." For PDF specifically, you either have to get a crappy version of the plain text using heuristics in a way that is very sensitive to how the PDF is exported, or you have to go full OCR, and it's annoying when a project locks you into one or the other. I'm also not sure I'd want to use the speech-to-text features here since they might have very different performance characteristics than the text-to-text stuff.
The reason there's a lot of startups in the OCR space (us being one of them) is the classic 80/20 rule. Any solution that's 80% accurate just doesn't work for most applications.
Converting a clean .docx into markdown is 10 lines of python. But what about the same document with a screenshot of an excel file? Or complex table layouts? The .NORM files that people actually use. Definitely agree with having a toggle between rules-based/ocr. But if you're looking at company wide docs, you won't always know which to pick.
The response from MarkItDown seems pretty barebones. I expected it to convert the clean pdf table element into a markdown table, but it just pulls the plaintext, which drops the header/column relationship.
> Any solution that's 80% accurate just doesn't work for most applications.
And yet people use LLMs, for which "80% accuracy" is still mostly an aspiration. :-)
I think it's reasonably likely most people companies end up using open source libraries, at least partly because it lets them avoid adding another GDPR sub-processor. Unstructured.io, one of your competitors, goes as far as having an AWS Marketplace setup so customers can use their own infrastruture but still pay them.
LLMs might get better at consuming badly-formatted data, so the data only needs to meet that minimum bar, vs the admittedly very nice output you showed.
> LLMs might get better at consuming badly-formatted data
Oh agreed. There's definitely a meeting in the middle between better ingestion and smarter models. LLMs are already a great fuzzing layer for that type of interpretation. And even with a perfect WYSIWYG text extraction, you're still limited by how coherent the original document was in the first place.
I don't think tabular data of any sort is a particularly good fit for LLMs at the moment. What are you trying to do with it?
If you want to answer questions like "how many students does Everglade High School have?" and you have a spreadsheet of schools where one of the columns is "number of students" I guess you could feed that into an LLM, but it doesn't feel like a great tool for the job.
I'd instead use systems like ChatGPT Code Interpreter where the LLM gets to load up that data programatically and answer questions by running code against it. Text-to-SQL systems could work well for that too.
For me personally, a lot of times it's for table augmentation purposes. Appending additional columns to a dataset, such as a cleaned/standardized version of another field, extracting a value from another field, or appending categorization attributes (sometimes pre-seeded and sometimes just giving it general direction).
Or sometimes I'll manually curate a field like that, and then ask it to generate an Excel function that can be used to produce as similar a result as possible for automated categorization in the future.
So in most cases I both want to provide it with tabular data, and also want tabular data back out. In general I've gotten decent results for these sorts of use cases, but when it falls down it's almost always addressable by tinkering with the formatting related instructions – sometimes by tweaking the input and sometimes by tweaking the instructions for the desired output.
Give it the data as separate columns. For each cell give it the row index and the data.
That way it's just working with lists but can easily key that eg all this data is in row 3, etc. Tell it to correlate data by the first value in the pair like that.
> I say "decent" because most of the available training data for Pandas does things in a naive way.
They're around the level of the median user, which is pretty bad as pandas is a big and complicated API with many different approaches available (as is base R, in case people think I'm just hating on pandas).
I've seen enough examples of an LLM misinterpreting a column or row - resulting in returning the incorrect answer to a question because it was off by one in one of the directions - that I'm nervous about trusting them for this.
JSON objects are different - there the key/value relationship is closer in the set of tokens which usually makes it more reliable.
yeah... so, you want to two step it. Parse the table into something structured, then answer the question. For a lot of LLM "problems", it's about the same as teaching a kid a multi-step problem in math - if you try to do it in one step, you are going to have a hard time .
The only reason I'm not immediately answering is because I need to check whether it's a trade secret. We do our own thing that I haven't seen anywhere else and works super well. Sorry for being mysterious, I'll try to get an OK to share.
LLM providers also let you send PDFs directly, too.
OTOH, sometimes you are the LLM provider, and you may not be using a multimodal LLM. (Or, even though feeding an LLM is a common use. You may be using the markdown for another purpose.)
Pandoc (https://pandoc.org) can be used to convert a .docx file to markdown and other file formats like djot and typst. I don't think pandoc can convert powerpoint and excel files.
The hard part about document conversion is not finding a tool which can convert the formats but the tool which does it best. I wonder how MarkItDown ranks for the tasks for the various types.
The README of MarkItDown mentions "indexing and text analysis" as the two motivating features, whereas Pandoc is more interested in document preparation via conversion that maintains rich text formatting.
Since my personal use leans towards the latter, I'm hesitant to believe that this tool will work better for me but others may have other priorities.
That was the first thing I checked, and it looks like they’re using some existing python package to parse docx files. I wonder if they contributed to it or vetted it strongly
Looking at the code, it looks like they used existing Python packages to read and parse MS Office formats, not what I expected, seeing that the repo is in Microsoft's org on GitHub I expected them to have used Microsoft's "official" libraries for parsing these formats, through Component Object Model (COM).
They used Mammoth for docx (Word) [1][2]
Python-pptx for ppt (PowerPoint) [3][4]
and Pandas for XSLX (Excel) [5]
I index a lot of tabletop RPG books in PDF format, which often have complex visual layouts and many tables that parsers typically have difficulty with. If this is just a wrapper around PDFMiner, as noted in another comment, I don't see any value added by this tool.
This handles them... fine. It either doesn't recognize or never attempts to handle tables, which makes it fundamentally a non-starter for my typical usage, but to its credit it seems to have at least some sense of table cells; it organizes columns in a manner that isn't fully readable but isn't as broken as some other solutions, either.
It otherwise handles text that's in variable-width columns or wrapped in complex ways around art work rather well. It inserts extraneous spaces on fully justified text, which is frustrating but not unusual, and sometimes adds extraneous line breaks on mid-sentence column breaks.
The biggest miss, though, is how it completely misses headings! This seems fundamental for any use case, including grooming sources for LLM training. It doesn't identify a single heading in any PDF I've thrown at it so far.
Nary a mention of LLMs in the readme. That was an unexpected but pleasant surprise, when the idea of converting something to markdown for LLMs is floated as if it's new and the greatest thing since sliced bread. https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...
It's interesting to read the code. It's mostly glue code, and most of it is in single 1101 line file. But it does indeed say what the README says it does. Here is the special handling for Wikipedia: https://github.com/microsoft/markitdown/blob/main/src/markit...
Edit: good to see the one from yesterday flagged. I tried to assume good intent, but also wondered if it was a place to draw a line in the sand. https://news.ycombinator.com/item?id=42405758
Edit 2: ah, it came down to simple violation of the Show HN rules. I didn't notice, but yeah, that's definitely the case.
Yep, and that’s fine. It’s just that there is a lot of false assumptions and magical thinking going around about LLMs and Markdown and I was glad to not find any in the README.
In my use, docling has not involved an LLM. There are a few choices for OCR, but I don't think a vision model is one of them.
It's certainly touted as a solution to digest documents into plain text for LLM use, but (unless I just haven't run into that part of it) it does not employe an LLM for its functions.
This is amazing and really useful, love the idea; but let me tell you a story, it's a bit of a tangent but relevant enough:
In an online language class we were sending the assignments to our teacher via slack, the teacher would then mark our mistakes and send it back.
I, as a true hater of all the heavy weight text formats for everyday communications, autonomously fired up the terminal, wrote my assignment in my_name.md and happily sent it without giving it any thought. This is what I hear the next session:
"... and everybody did a great job! Although someone just sent me their assignment in a stupid format. I don't know what it was! I could neither highlight it or make the text bold or anything. Don't do that to me again please".
Before that I never dreamed of meeting someone who preferred a word document _after_ opening a .md file, and I also learned if I had chosen product design as a career, everyone would've suffered immensely (or maybe not, I would've just ended up jobless).
> Before that I never dreamed of meeting someone who preferred a word document _after_ opening a .md file
That's like 90% of the people I know outside of computer/engineering circle. Most of people probably have never opened a plaintext file in their life. They would have no idea what to do with a `.md` file.
In fact, some older engineers would not know what markdown is either, since it's only been around for two decades or so, but they can probably work with it anyway (the strength of plain text).
i got into a similar predicament helping format some course outline documents for a university friend of mine with rsi issues... I foolishly assumed a series of documents that are to be viewed online would be better as a markdown or html format... before realising by the end of the day i had unwittingly thrown myself into the gears of war between paper and digital. Modern universities are essentially an elaborate microsoft word shilling scheme with an obsession with virtual paper!
If you are talking about an online language class as in "I'm learning Yiddish" then I don't understand why it would confuse that that someone who isn't a coder or writer (and they're a big if) who doesn't know what the heck markdown is and hence wouldn't want to deal with it since they're used to MS Word or other word processor app. that's probably like 95% of the population at least.
This is... interesting. From my understanding - and people can correct me if I'm wrong, but didn't Microsoft spend an extremely large amount of effort essentially trying to screw people who made things like this in the 2000s? Interoperability and the Open Office movement were prety hard fought. It's kind of crazy to see MSFT do this today. Did I just misunderstand and the underlying formats (docx etc) were actually pretty friendly, or have the formats evolved a lot since then? Or is it more a case of "It doesn't matter if it looks terrible because we're feeding it to the AI beast anyway"
A cynic might say it became suddenly easy when MSFT had a reason to allow you to genereate markdown to feed into it's AI?
Microsoft filed a covenant not to sue and made all the formats open ~20 years ago. A lot of people bitched at the time but there's a long list of software that supports the format now. It is complicated because the apps themselves are complicated and decades old, and imperfect because the format or app you're converting to likely doesn't support all of the features and certainly none of the quirks.
Though it promises to convert everything to Markdown, it seems to be
a worse version of what the already existing tools such as PDFtotext, docx2txt, pptx2md, ... collected [here] do without even pretending to export to Markdown.
Looking at its [source], it indeed seems to be a wrapper to python variants of those. Making the pool smaller can hardly improve the output.
I can't tell if you're trolling or what but the idea of most business users (a) knowing markdown (b) reverting to html for the damn near infinite layout and/or styling things that markdown doesn't support (c) ignoring mail merge (d) wanting change tracking ... makes your comment laughable
Are there any good libraries for the opposite, going from markdown to pdf or docx? Pandoc gets most of the way there but struggles with certain things like tables.
I will try it with some complex layout PDFs or documents with tables. These documents have real business use cases for automation — insurance, banking, etc.
Anyone here who wants to convert PDF documents or scanned images as it is preserving the layout, do try LLMWhisperer - https://unstract.com/llmwhisperer/
If the source document is anything half decent, this would serve to lose information, as markdown is far from flexible and powerful enough to represent all kinds of formatting and layout present in source documents. If all you need is the text information, then that might be just what you want, lossily compressing documents.
Was just yesterday working on chaining together `xlsx` and `tablemark` to accomplish this. I found `uvx markitdown my-excel.XLSX | sed 's/ NaN/ /g' my-markdown.md` to be just what I needed to get my spreadsheet into a reasonably-legible markdown table when rendered by GitLab.
Since it’s Microsoft maybe it will do a half decent job on Outlook HTML and .docx. I have evaluated most of them out there, paid included and haven’t found one that I thought was good enough to run in production. Definitely will be giving this a try.
Org-mode. Emacs Org mode has tables with formulas, being able to make use of many programming languages. By default Calc (I believe GNU Calc) and Elisp. However using sbe you can make it use code blocks written in any language that you have support for using org-babel. For example I have time tracking spreadsheets using source blocks of GNU Guile code for time calculation.
Of course you can put that under version control easily, since it is just a text document.
Your comment has me very curious what exactly you are looking for in a "markdown equivalent" for spreadsheets. Do you want Excel to be able to export the spreadsheet in a Markdown-like format (including formulas, etc)? Or do you want to build the spreadsheet in a text editor using Markdown++ syntax and then use some GUI application to render it? Or do you simply want an ASCII version of Excel that works in a terminal?
This is BS, it doesn't support Office documents, it supports only Microsoft's broken office documents which don't obey their own custom specs. Why doesn't this work on ODF files?
I've tried it against HTML and PDFs so far and it seems pretty decent.