Hacker News new | past | comments | ask | show | jobs | submit login
MarkItDown: Python tool for converting files and office documents to Markdown (github.com/microsoft)
329 points by Handy-Man 45 days ago | hide | past | favorite | 81 comments



If you have uv installed you can run this against a file without first installing anything like this:

    uvx markitdown path-to-file.pdf
(This will cache the necessary packages the first time you run it, then reuse those cached packages on future invocations.)

I've tried it against HTML and PDFs so far and it seems pretty decent.


Is uvx just part of uv? I keep a few python packages around via pipx (itself via homebrew) but am a big fan of uv for python projects… Do I just need to install uv globally (via brew?) to do this? Is there a mechanism to also have the installed utils available in my PATH (so I can invoke them without a uvx prefix)?


You can install to your path with 'uv tool install'.

uvx is just an alias for 'uv tool run'.


Thank you! I should explore the uv docs properly.


Wow that is magic! I just installed uv because of your comment.


I worked on an in-house version of this feature for my employer (turning files into LLM-friendly text). After reading the source code, I can say this is a pretty reasonable implementation of this type of thing. But I would avoid using it for images, since the LLM providers let you just pass images directly, and I would also avoid using it for spreadsheets, since LLMs are very bad at interpreting Markdown tables.

There are a lot of random startups and open source projects who try to make this space sound fancy, but I really hope the end state is a simple project like this, easy to understand and easy to deploy.

I do wish it had a knob to turn for "how much processing do you want me to do." For PDF specifically, you either have to get a crappy version of the plain text using heuristics in a way that is very sensitive to how the PDF is exported, or you have to go full OCR, and it's annoying when a project locks you into one or the other. I'm also not sure I'd want to use the speech-to-text features here since they might have very different performance characteristics than the text-to-text stuff.


The reason there's a lot of startups in the OCR space (us being one of them) is the classic 80/20 rule. Any solution that's 80% accurate just doesn't work for most applications.

Converting a clean .docx into markdown is 10 lines of python. But what about the same document with a screenshot of an excel file? Or complex table layouts? The .NORM files that people actually use. Definitely agree with having a toggle between rules-based/ocr. But if you're looking at company wide docs, you won't always know which to pick.

Example with one of our test files:

Input: https://omni-demo-data.s3.us-east-1.amazonaws.com/zerox/Omni...

MarkItDown: https://omni-demo-data.s3.us-east-1.amazonaws.com/zerox/mark...

Ours: https://omni-demo-data.s3.us-east-1.amazonaws.com/zerox/omni...

The response from MarkItDown seems pretty barebones. I expected it to convert the clean pdf table element into a markdown table, but it just pulls the plaintext, which drops the header/column relationship.


> Any solution that's 80% accurate just doesn't work for most applications.

And yet people use LLMs, for which "80% accuracy" is still mostly an aspiration. :-)

I think it's reasonably likely most people companies end up using open source libraries, at least partly because it lets them avoid adding another GDPR sub-processor. Unstructured.io, one of your competitors, goes as far as having an AWS Marketplace setup so customers can use their own infrastruture but still pay them.

LLMs might get better at consuming badly-formatted data, so the data only needs to meet that minimum bar, vs the admittedly very nice output you showed.


> LLMs might get better at consuming badly-formatted data

Oh agreed. There's definitely a meeting in the middle between better ingestion and smarter models. LLMs are already a great fuzzing layer for that type of interpretation. And even with a perfect WYSIWYG text extraction, you're still limited by how coherent the original document was in the first place.


From your experience, what would be the best way to handle spreadsheets?


I don't think tabular data of any sort is a particularly good fit for LLMs at the moment. What are you trying to do with it?

If you want to answer questions like "how many students does Everglade High School have?" and you have a spreadsheet of schools where one of the columns is "number of students" I guess you could feed that into an LLM, but it doesn't feel like a great tool for the job.

I'd instead use systems like ChatGPT Code Interpreter where the LLM gets to load up that data programatically and answer questions by running code against it. Text-to-SQL systems could work well for that too.


This is an active area of research: https://github.com/SpursGoZmy/Awesome-Tabular-LLMs is a good starting point!


For me personally, a lot of times it's for table augmentation purposes. Appending additional columns to a dataset, such as a cleaned/standardized version of another field, extracting a value from another field, or appending categorization attributes (sometimes pre-seeded and sometimes just giving it general direction).

Or sometimes I'll manually curate a field like that, and then ask it to generate an Excel function that can be used to produce as similar a result as possible for automated categorization in the future.

So in most cases I both want to provide it with tabular data, and also want tabular data back out. In general I've gotten decent results for these sorts of use cases, but when it falls down it's almost always addressable by tinkering with the formatting related instructions – sometimes by tweaking the input and sometimes by tweaking the instructions for the desired output.


Give it the data as separate columns. For each cell give it the row index and the data.

That way it's just working with lists but can easily key that eg all this data is in row 3, etc. Tell it to correlate data by the first value in the pair like that.


LLMs are decent at Pandas.

I say "decent" because most of the available training data for Pandas does things in a naive way.

OTOH, they are horrible at Polars. (I figure this is mostly a lack of training data.)


> I say "decent" because most of the available training data for Pandas does things in a naive way.

They're around the level of the median user, which is pretty bad as pandas is a big and complicated API with many different approaches available (as is base R, in case people think I'm just hating on pandas).


Many LLMs are ok with json and html tables. Not perfect, but not terrible.


I've seen enough examples of an LLM misinterpreting a column or row - resulting in returning the incorrect answer to a question because it was off by one in one of the directions - that I'm nervous about trusting them for this.

JSON objects are different - there the key/value relationship is closer in the set of tokens which usually makes it more reliable.


yeah... so, you want to two step it. Parse the table into something structured, then answer the question. For a lot of LLM "problems", it's about the same as teaching a kid a multi-step problem in math - if you try to do it in one step, you are going to have a hard time .


Markdown isn’t suitable for most spreadsheets in the first place, IMO.


The only reason I'm not immediately answering is because I need to check whether it's a trade secret. We do our own thing that I haven't seen anywhere else and works super well. Sorry for being mysterious, I'll try to get an OK to share.

Edit: yeah I can't talk about it, sorry


LLM providers also let you send PDFs directly, too.

OTOH, sometimes you are the LLM provider, and you may not be using a multimodal LLM. (Or, even though feeding an LLM is a common use. You may be using the markdown for another purpose.)


> LLMs are very bad at interpreting Markdown tables

Which table format is better for LLMs? Do you have some insights there?


For PDFs it's entirely a wrapper around https://pdfminersix.readthedocs.io/en/latest/tutorial/highle... - https://github.com/microsoft/markitdown/blob/main/src/markit...

So if that's your use case, PDFMiner might be better to integrate with directly!


or just use pymupdf


pymupdf has a commercial licence that couldb be a problem if use in a compagny.


Pandoc (https://pandoc.org) can be used to convert a .docx file to markdown and other file formats like djot and typst. I don't think pandoc can convert powerpoint and excel files.


The hard part about document conversion is not finding a tool which can convert the formats but the tool which does it best. I wonder how MarkItDown ranks for the tasks for the various types.


The README of MarkItDown mentions "indexing and text analysis" as the two motivating features, whereas Pandoc is more interested in document preparation via conversion that maintains rich text formatting.

Since my personal use leans towards the latter, I'm hesitant to believe that this tool will work better for me but others may have other priorities.


MarkItDown feels like running strings; the output is great for text extraction and processing, not for reading by humans


Yeah that was the interesting part to me, at least. Plus, it's Microsoft so hopefully it will work for their files.


That was the first thing I checked, and it looks like they’re using some existing python package to parse docx files. I wonder if they contributed to it or vetted it strongly


Wow, I dunno if that's good or bad, certainly it's not what I expected.


Looking at the code, it looks like they used existing Python packages to read and parse MS Office formats, not what I expected, seeing that the repo is in Microsoft's org on GitHub I expected them to have used Microsoft's "official" libraries for parsing these formats, through Component Object Model (COM).

They used Mammoth for docx (Word) [1][2] Python-pptx for ppt (PowerPoint) [3][4] and Pandas for XSLX (Excel) [5]

[1] https://github.com/microsoft/markitdown/blob/70ab149ff1657c3... [2] https://pypi.org/project/mammoth/ [3] https://github.com/microsoft/markitdown/blob/70ab149ff1657c3... [4] https://pypi.org/project/python-pptx/ [5] https://github.com/microsoft/markitdown/blob/70ab149ff1657c3...


COM requires you to interact with the files through the associated MS Office applications, whereas these libs parse the ooxml file format directly.


...I did not catch that it was from Microsoft. I was wondering why a random markdown converter was so notable.


I index a lot of tabletop RPG books in PDF format, which often have complex visual layouts and many tables that parsers typically have difficulty with. If this is just a wrapper around PDFMiner, as noted in another comment, I don't see any value added by this tool.

This handles them... fine. It either doesn't recognize or never attempts to handle tables, which makes it fundamentally a non-starter for my typical usage, but to its credit it seems to have at least some sense of table cells; it organizes columns in a manner that isn't fully readable but isn't as broken as some other solutions, either.

It otherwise handles text that's in variable-width columns or wrapped in complex ways around art work rather well. It inserts extraneous spaces on fully justified text, which is frustrating but not unusual, and sometimes adds extraneous line breaks on mid-sentence column breaks.

The biggest miss, though, is how it completely misses headings! This seems fundamental for any use case, including grooming sources for LLM training. It doesn't identify a single heading in any PDF I've thrown at it so far.


Nary a mention of LLMs in the readme. That was an unexpected but pleasant surprise, when the idea of converting something to markdown for LLMs is floated as if it's new and the greatest thing since sliced bread. https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

It's interesting to read the code. It's mostly glue code, and most of it is in single 1101 line file. But it does indeed say what the README says it does. Here is the special handling for Wikipedia: https://github.com/microsoft/markitdown/blob/main/src/markit...

Edit: good to see the one from yesterday flagged. I tried to assume good intent, but also wondered if it was a place to draw a line in the sand. https://news.ycombinator.com/item?id=42405758

Edit 2: ah, it came down to simple violation of the Show HN rules. I didn't notice, but yeah, that's definitely the case.


> Nary a mention of LLMs in the readme. That was an unexpected but pleasant surprise

No surprise it has still managed to come up in the comments in spite of that!


Yep, and that’s fine. It’s just that there is a lot of false assumptions and magical thinking going around about LLMs and Markdown and I was glad to not find any in the README.


Quite curious how this compares to docling - https://github.com/DS4SD/docling

docling uses an LLM IIRC, so that's already a difference in approach


In my use, docling has not involved an LLM. There are a few choices for OCR, but I don't think a vision model is one of them.

It's certainly touted as a solution to digest documents into plain text for LLM use, but (unless I just haven't run into that part of it) it does not employe an LLM for its functions.


docling does not use LLMs...


This is amazing and really useful, love the idea; but let me tell you a story, it's a bit of a tangent but relevant enough:

In an online language class we were sending the assignments to our teacher via slack, the teacher would then mark our mistakes and send it back.

I, as a true hater of all the heavy weight text formats for everyday communications, autonomously fired up the terminal, wrote my assignment in my_name.md and happily sent it without giving it any thought. This is what I hear the next session:

"... and everybody did a great job! Although someone just sent me their assignment in a stupid format. I don't know what it was! I could neither highlight it or make the text bold or anything. Don't do that to me again please".

Before that I never dreamed of meeting someone who preferred a word document _after_ opening a .md file, and I also learned if I had chosen product design as a career, everyone would've suffered immensely (or maybe not, I would've just ended up jobless).


> Before that I never dreamed of meeting someone who preferred a word document _after_ opening a .md file

That's like 90% of the people I know outside of computer/engineering circle. Most of people probably have never opened a plaintext file in their life. They would have no idea what to do with a `.md` file.

In fact, some older engineers would not know what markdown is either, since it's only been around for two decades or so, but they can probably work with it anyway (the strength of plain text).


Exactly! Hence the "please don't try product design role" advice for me. I seem to live in an all-engineers bubble.


Engineers are people too. Engineers use products as well. Maybe you would have gone into a saner direction than most products go.


i got into a similar predicament helping format some course outline documents for a university friend of mine with rsi issues... I foolishly assumed a series of documents that are to be viewed online would be better as a markdown or html format... before realising by the end of the day i had unwittingly thrown myself into the gears of war between paper and digital. Modern universities are essentially an elaborate microsoft word shilling scheme with an obsession with virtual paper!


If you are talking about an online language class as in "I'm learning Yiddish" then I don't understand why it would confuse that that someone who isn't a coder or writer (and they're a big if) who doesn't know what the heck markdown is and hence wouldn't want to deal with it since they're used to MS Word or other word processor app. that's probably like 95% of the population at least.


It doesn't confuse anyone, quite the opposite. The irony for me was my own isolation with the non-tech folks.


This is... interesting. From my understanding - and people can correct me if I'm wrong, but didn't Microsoft spend an extremely large amount of effort essentially trying to screw people who made things like this in the 2000s? Interoperability and the Open Office movement were prety hard fought. It's kind of crazy to see MSFT do this today. Did I just misunderstand and the underlying formats (docx etc) were actually pretty friendly, or have the formats evolved a lot since then? Or is it more a case of "It doesn't matter if it looks terrible because we're feeding it to the AI beast anyway"

A cynic might say it became suddenly easy when MSFT had a reason to allow you to genereate markdown to feed into it's AI?


Microsoft filed a covenant not to sue and made all the formats open ~20 years ago. A lot of people bitched at the time but there's a long list of software that supports the format now. It is complicated because the apps themselves are complicated and decades old, and imperfect because the format or app you're converting to likely doesn't support all of the features and certainly none of the quirks.

https://en.wikipedia.org/wiki/Office_Open_XML

It took browsers 15 years just to render HTML whitespace nearly consistently, so keep that in mind as you read that history.


I don't think that's a cynical take considering the description

> (e.g., for indexing, text analysis, etc.)


Though it promises to convert everything to Markdown, it seems to be a worse version of what the already existing tools such as PDFtotext, docx2txt, pptx2md, ... collected [here] do without even pretending to export to Markdown. Looking at its [source], it indeed seems to be a wrapper to python variants of those. Making the pool smaller can hardly improve the output.

[here] https://github.com/Konfekt/vim-office [source] htps://github.com/microsoft/markitdown/blob/main/src/markitdown/_markitdown.py


Why is the repository 95% "HTML" code?


There's some very large HTML files in the test directory, including an offline version of the Microsoft Wikipedia page


And the core code mostly calls other libraries for heavy lifting -- eg `mammoth`: https://github.com/mwilliamson/python-mammoth


tests


Never thought I'd see the day. Yet... not surprising because plain text is the ideal format for analysis, LLM training, etc.

The question businesses will start to ask is why are we putting our data into .docx files in the first place?


I can't tell if you're trolling or what but the idea of most business users (a) knowing markdown (b) reverting to html for the damn near infinite layout and/or styling things that markdown doesn't support (c) ignoring mail merge (d) wanting change tracking ... makes your comment laughable


Are there any good libraries for the opposite, going from markdown to pdf or docx? Pandoc gets most of the way there but struggles with certain things like tables.


it would be cool if Word just had that implemented inside the product like Google Docs does.


I will try it with some complex layout PDFs or documents with tables. These documents have real business use cases for automation — insurance, banking, etc.

Anyone here who wants to convert PDF documents or scanned images as it is preserving the layout, do try LLMWhisperer - https://unstract.com/llmwhisperer/


So we convert from rich formats with metadata & advanced features to a format without the former & severely lacking at the latter.


If the source document is anything half decent, this would serve to lose information, as markdown is far from flexible and powerful enough to represent all kinds of formatting and layout present in source documents. If all you need is the text information, then that might be just what you want, lossily compressing documents.


Very timely, thanks!

Was just yesterday working on chaining together `xlsx` and `tablemark` to accomplish this. I found `uvx markitdown my-excel.XLSX | sed 's/ NaN/ /g' my-markdown.md` to be just what I needed to get my spreadsheet into a reasonably-legible markdown table when rendered by GitLab.


I made a version that can run entirely within the browser

https://www.html.zone/markitdown/


Since it’s Microsoft maybe it will do a half decent job on Outlook HTML and .docx. I have evaluated most of them out there, paid included and haven’t found one that I thought was good enough to run in production. Definitely will be giving this a try.


I wish we had a markdown equivalent for spreadsheets. Markdown tables ain’t it.


Org-mode. Emacs Org mode has tables with formulas, being able to make use of many programming languages. By default Calc (I believe GNU Calc) and Elisp. However using sbe you can make it use code blocks written in any language that you have support for using org-babel. For example I have time tracking spreadsheets using source blocks of GNU Guile code for time calculation.

Of course you can put that under version control easily, since it is just a text document.


Your comment has me very curious what exactly you are looking for in a "markdown equivalent" for spreadsheets. Do you want Excel to be able to export the spreadsheet in a Markdown-like format (including formulas, etc)? Or do you want to build the spreadsheet in a text editor using Markdown++ syntax and then use some GUI application to render it? Or do you simply want an ASCII version of Excel that works in a terminal?


the second one.

I want a human editable plain text spreadsheet format. CSV and TSV ain’t it, not the least of which is because they don’t have formulae.


I don't think it works if you try installing it using pip. Can anyone confirm? I ended up downloading it manually, making a venv, and then running it.


I wonder how a powerpoint can be converted to markdown


Oh thank god. I can finally retire my docx to pandoc to markdown tool chain. I can’t believe M$ was the big one to go first. Good on ya.


Converters like this are much more useful if they are bi-directional, even if the two directions aren't exactly inverses.


Why not Pandoc?


Pandoc does not have a PDF reader.


This is BS, it doesn't support Office documents, it supports only Microsoft's broken office documents which don't obey their own custom specs. Why doesn't this work on ODF files?


anyone get the Bing search DocumentConverter working? It keeps getting me null results


any idea how it compares to Docling?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: