Hi HN!
I'm stoked to share a project I've been working on called PDF to Podcast. It's a free, open-source tool that automatically converts PDF documents into engaging, informative podcast-style audio content using large language models and text-to-speech tech.
Inspiration:
The idea for this project came from the NotebookLM demo at Google I/O, where they showcased generating audio dialogue from uploaded PDFs and other sources. However, that audio feature hasn't been publicly released yet, and I wanted to challenge myself to build something similar using existing tools and APIs.
How it works:
The user uploads a PDF
The tool extracts the text and feeds it into Google's Gemini Flash language model
Gemini Flash generates a natural, engaging podcast dialogue script based on the key information in the document
This script is then converted to audio using OpenAI's text-to-speech API
The user can listen to the generated "podcast episode" and read along with the transcript
I chose to use Gemini Flash for the language model because it's good at writing high-quality prose while being fast and cheap. We use OpenAI's TTS API to then bring the dialogue to life.
Under the hood, it's built with Python, FastAPI, Gradio for the web UI, and my own library, promptic, for calling the LLM and getting structured output. The code is open-source and available on GitHub.
Apart from the tool's practical utility, I'm hoping this project can serve as a helpful example for others looking to build applications on top of large language models. It demonstrates an end-to-end flow from document intake to language model usage to audio output, with a simple web interface on top.
I would love to hear any feedback or ideas from the HN community! I think there's a lot of potential to expand on this concept and make all sorts of written content more accessible and engaging through audio conversion. Let me know what you think :)
It starts like this:
The way this uses different OpenAI TTS voices for the different roles is really neat!