Hacker News new | past | comments | ask | show | jobs | submit login
Why are cancer guidelines stuck in PDFs? (seangeiger.substack.com)
288 points by huerne 1 day ago | hide | past | favorite | 147 comments





I’d rather have the pdf than a custom tool. Especially considering the tool will be unique to the practice or emr. And likely expensive to maintain.

PDFs suck in many ways but are durable and portable. If I work with two oncologists, I use the same pdf.

The author means well but his solution will likely be worse because only he will understand it. And there’s a million edge cases.


Hey author here! Appreciate the feedback! Agreed on importance of portability and durability.

I'm not trying to build this out or sell it as a tool to providers. Just wanted to demo what you could do with structured guidelines. I don't think there's any reason this would have to be unique to a practice or emr.

As sister comments mentioned, I think the ideal case here would be if the guideline institutions released the structured representations of the guidelines along with the PDF versions. They could use a tool to draft them that could export in both formats. Oncologists could use the PDFs still, and systems could lean into the structured data.


The cancer reporting protocols from the College of American Pathologists are available in structured format (1). No major laboratory information system vendor properly implements them, properly, and their implementation errors cause some not-insignificant problems with patient care (oncologists calling the lab asking for clarification, etc). This has pushed labs to make policies disallowing the use of those modules and individual pathologists reverting to their own non-portable templates in Word documents.

The medical information systems vendors are right up there with health insurance companies in terms of their investment in ensuring patient deaths. Ensuring. With an E.

(1) https://www.cap.org/protocols-and-guidelines/electronic-canc...


It doesn't look like the XML data is freely accessible.

If I could get access to this data as a random student on the internet, I'd love to create an open source tool that generates an interactive visualization.


The problem is that a bug could kill people.

> The medical information systems vendors are right up there with health insurance companies in terms of their investment in ensuring patient deaths. Ensuring. With an E.

Can you expand on this?


Medical information system vendors only care about making a profit, not implementing actual solutions. The discrepancies between systems can lead to bad information which can cost people their life.

As an analogy, imagine if the consequence of Oracle doing Oracle-as-usual things was worse medical outcomes. But they did them anyway for profit.

That's basically medical information system vendors.

The fact that the US hasn't pushed open source EMRs through CMS is insane. It's literally the perfect problem for an open solution.


It's worse than that. VistA is a world-class open source EMR that the VA has been trying to kill for decades.

VistA was useful in it's time but it's hardly world class anymore. There were fundamental problems with the platform stack and data model which made it effectively impossible to keep moving forward.

Since Oracle bought Cerner a few years ago, no imagination needed. Sadly, since Cerner has lots of good people who want to make good products.

I love open source EMRs, but has any major country adopted open source EMRs?

I know OpenMRS exists but is mainly used within developing nations.

The US has Vista, made by VA, and it is a beast and no one really wants to use it.


If I understand correctly, Estonia made their own EMR/EHR from scratch. The government produced (and commissioned?) software is all open source. https://koodivaramu.eesti.ee/explore

EMR software seems like something that shouldn't be that hard. It's fundamentally a CRUD. Sure, there's a lot of legacy to interface with, but medical software seems like a deeply dysfunctional and probably corrupt industry.


It’s a famous “should be easy” use case. I think this is wrong only because no one does it.

>The fact that the US hasn't pushed open source EMRs through CMS is insane. It's literally the perfect problem for an open solution.

It's not insane, it's because the US is an oligarchy. And it's about to go even more oligarchy on steroids in the next year.


What explains most other democracies not doing it?

Is Sweden an oligarchy, too? Or France? Etc etc


It wouldn't be appropriate for the federal government to push any particular product. They have certified open source EHRs. It's not at all clear that increased adoption of those would improve patient outcomes.

https://chpl.healthit.gov/#/search


People could potentially properly implement them if they were open and available:

"Contact the CAP for more information about licensing and using the CAP electronic Cancer Protocols for cancer reporting at your institution."

This stinks of the same gate-keeping that places like NIST and ISO do, charging you for access to their "standards".


Aren’t all NIST standards free as they are a government body?

For liability reasons alone, you cannot just have random people working on health/lab stuff and the requisite vendors have access to these standards.

According to what killjoywashere said, the vendors do not want to implement these standards. So if CAP wants the standards to be relevant, they should release them for random people to implement.

I mean, you're attributing malice, but it could just be that reliably implementing the formats is a really really hard problem?

How about fixing the format? Something that is obviously broken and resulting in patient deaths should really be considered a top priority. It's either malice or masskve incompetence. If these protocols were open there would definitely be volunteers willing to help fix it.

You seem to think that the default assumption is that fixing the format is easy/feasible, and I don't see why. Do you have domain knowledge pointing that way?

It's a truism in machine learning that curating and massaging your dataset is the most labor-intensive and error-prone part of any project. I don't why that would stop being true in healthcare just because lives are on the line.


I think there are more options than malice or incompetence. My theory is difficulty.

There’s multiple countries with socialized medicine and no profit motive and it’s still not solved.

I think it’s just really complex with high negative consequences from a mistake. It takes lots of investment with good coordination to solve and there’s an “easy workaround” with pdfs that distributes liability to practitioners.


Healthcare suffers from strict regulatory requirements, underinvestment in organic IT capabilities, and huge integration challenges (system-to-system).

Layering any sort of data standard into that environment (and evolving it in a timely manner!) is nigh impossible without an external impetus forcing action (read: government payer mandate).


Incompetence at this level is intentional, it means someone doesn't think they'll see RoI from investing resources into improving it. Calling it malice is appropriate I feel.

If there is no ROI, investing further resources would be charity work. I don’t think it’s accurate to call a company not doing so malicious.

Not actively malicious perhaps, but prioritising profits over lives is evil. Either you take care to make sure the systems you sell lead to the best possible outcomes, or you get out of the sector.

The company not existing at all might be worse though? I think it’s too easy to make blanket judgments like that from the outside, and it would be the job of regulation to counteract adverse incentives in the field.

You're making a lot of unsupported assumptions. There's no reliable evidence that this is causing patient deaths, or that a different format would reduce the death rate.

I believe you have good intentions, but someone would need to build it out and sell it. And it requires lots of maintenance. It’s too boring for an open source community.

There’s a whole industry that attempts to do what you do and there’s a reason why protocols keep getting punted back to pdf.

I agree it would be great to release structured representations. But I don’t think there’s a standard for that representation, so it’s kind of tricky as who will develop and maintain the data standard.

I worked on a decision support protocol for Ebola and it was really hard to get code sets released in Excel. Not to mention the actual decision gates in a way that is computable.

I hope we make progress on this, but I think the incentives are off for the work to make the data structures necessary.


>Agreed on importance of portability and durability.

I think "importance" is understating it, because permanent consistency is practically the only reason we all (still) use PDFs in quite literally every professional environment as a lowest common denominator industrial standard.

PDFs will always render the same, whether on paper or a screen of any size connected to a computer of any configuration. PDFs will almost always open and work given Adobe Reader, which these days is simply embedded in Chrome.

PDFs will almost certainly Just Work(tm), and Just Working(tm) is a god damn virtue in the professional world because time is money and nobody wants to be embarrassed handing out unusable documents.


PDFs generally will look close enough to the original intent that they will almost always be usable, but will not always render the same. If nothing else, there are seemingly endless font issues.

In this day and age that seems increasingly like a solved problem to most end users, often a client-side issue or using a very old method of generating a PDF?

Modern PDF supports font embedding of various kinds (legality is left as an exercise to the PDF author) and supports 14 standard font faces which can be specified for compatibility, though more often document authors probably assume a system font is available or embed one.

There are still problems with the format as it foremost focuses on document display rather than document structure or intent, and accessibility support in documents is often rare to non-existent outside of government use cases or maybe Word and the like.

A lot of usability improvements come from clients that make an attempt to parse the PDF to make the format appear smarter. macOS Preview can figure out where columns begin and end for natural text selection, Acrobat routinely generates an accessible version of a document after opening it, including some table detection. Honestly creative interpretation of PDF documents is possibly one of the best use cases of AI that I’ve ever heard of.

While a lot about PDF has changed over the years the basic standard was created to optimize for printing. It’s as if we started with GIF and added support to build interactive websites from GIFs. At its core, a PDF is just a representation of shapes on a page, and we added metadata that would hopefully identify glyphs, accessible alternative content, and smarter text/line selection, but it can fall apart if the PDF author is careless, malicious or didn’t expect certain content. It probably inherits all the weirdness of Unicode and then some, for example.


I would assume these decision tree PDF use a commonly available font. Layout and interpreted outcomes should be the same.

I think there’s value if it can scale down.

Community oncologists have limited technology resources as compared to a national cancer center. If we can make their lives easier, it can only be a good thing.

That said, I like published documents like PDFs - systems usually make it hard to conii ok are the June release from the September release.


I agree. However, since the PDF format supports structured data, one could in principle have it both ways, within a single file.

^ This. See, e.g., https://lab6.com/ for some interesting tricks with the PDF format.

Exactly. The PDF's work. They won't break. You can see all the information with your own eyes. You can send them by e-mail.

A wizard-type system hides most of the information from you, it might have bugs you aren't aware of, if you want to glance at an alternative path you can't, it's going to be locked into registered users, the system can go down.

I think much more intelligent computer systems are the future in health care, but I doubt the way to start is with yet another custom tool designed specifically for cancer guidelines and nothing else.


> it's going to be locked into registered users, the system can go down

I didn't see anything in the screenshots presented that wouldn't be doable in a single HTML file containing the data, styles and scripts?

This is a countercultural idea but it fits so many use cases; it's a tragedy we don't do this more often. The two options are either PDF or SaaS.


> The PDF's work. They won't break.

Not just that, PDFs are one of the few formats, where i'm willing to bet my own money, that they'll still work in 10 or 20 years.

Even basic html has changed, layouts look different depending on many factors, and even the <blink>-ing doesn't work anymore.


A case specific PDF could be created and stored in the patient's electronic records. Such PDF could just highlight the decision three path.

Sure, PDF/A is an ISO-standardized subset of the larger PDF spec designed expressly for archival purposes. You could do that with HTML but then how would you get your crypto mining AI chat bot powered by WASM to work?

You say this, but on the other hand, the author alleges that the places that use these custom tools achieve better outcomes. You didn't address this point one way or the other.

Do you think this is a completely fabricated non-explanation? It's not like the link says "the worst places use these custom tools."


Totally valid concerns. If you have time, I would like to show you my solution to get your thoughts as I believe I have found ways to mitigate all of your concerns. Currently I am using STCC (Schmitt-Thompson Clinical Content). I Have sent you some of the PDF's we use for testing.

The author is proposing that the DAG representation be in addition to the PDF:

>The organizations drafting guidelines should release them in structured, machine-interpretable formats in addition to the downloadable PDFs.

My opinion: Ideally the PDF could be generated from the underlying DAG -- that would give you confidence that everything in the PDF has been captured in the DAG.


You could generate the document from the graph and then attach it as data.

> could generate the document from the graph and then attach it as data

Much easier for doctors to draft PDFs than graphs.


I have not drafted a PDF myself and I doubt doctors will. They work in a text writer or spreadsheet application and then export or print to PDF would be my guess. An interactive interface could spit out the PDF with the decision three in the end. This solution would still mean the decision tree source is in some software package.

It would, I imagine, be much easier to generate a PDF from the tool's internal flowchart representation than the other way around.

The OP will be pleased to know that they’re not the first person to think of this idea. Searching for “computable clinical guidelines” will unearth a wealth of academic literature on the subject. A reasonable starting point would be this paper [1]. Indeed people have been trying since the 70s, most notably with the famous MYCIN expert system. [2]

As people have alluded to and the history of MYCIN shows, there’s a lot more subtlety to the problem than appears on the surface, with a whole bunch of technical, psychological, sociological and economic factors interacting. This is why cancer guidelines are stuck in PDFs.

Still, none of that should inhibit exploration. After all, just because previous generations couldn’t solve a problem doesn’t mean that it can’t be solved.

[1] https://pmc.ncbi.nlm.nih.gov/articles/PMC10582221/

[2] https://www.forbes.com/sites/gilpress/2020/04/27/12-ai-miles...


To the author:

The above is a high quality comment with worthy areas to study.

Additionally I would draw your attention to NCCN’s “Developer API” which is not interesting technologically but how it reflects the IP landscape.

https://www.nccn.org/developer-api


Why isn't all human knowledge in one big JSON file? Guidelines are decision trees, but they're not written to be applied by rote, because the identical patients with identical cancers posited in the hypothetical don't exist. The guidelines are not written for maverick clinician movie protagonists to navigate the decision tree in real time while racing against an oncological clock, they're for teams of clinicians who are very well trained and used to working with each other, and who have the skills to notice where the guidelines might be wrong or need expansion or modification. That is, they're abstractions of the state-of-the-art.

Now it'd be nice if these could be treated like source code on some sort of medical VCS, to be easily modded or even forked by sub-specialists. But wetware dependencies are way more complicated than silicon ones, and will remain harder to discretize for some time to come.

It's not that the author's aspirations misguided, they're great. But I believe progress in this area is most easily realized as part of a relevant team, because what looks conceptually simple from outside the system only seems that way because of a lack of resolution.


Because JSON is fucking horrible for humans?

If there's anything I want to read it's searchable paper.


>Because JSON is fucking horrible for humans?

It's better than XML at least.


JSON has never been better than XML. People are just terrible at XML and don't want to learn.

"People are just terrible at C++ and don't want to learn."

That too.

"At their core, guidelines are decision trees"

That's wishful and perhaps not even helpful as a goal. Guidelines rarely have the data to cover all possible legs of decisions. They report on well-supported findings, offer expert opinions on some interpolated cases, and perhaps list factors to consider for some of the remainder. If you reduced this to a decision tree, you'd find many branches are not covered, and most experts could identify factors that should lead to a more complex tree.

The reason is that branches are rarely definitive. It's more like quantum probabilities: you have to hold them all at once, and only when treatment works or doesn't does the disease (here cancer) declare itself as such.

Until the true information architecture of guidelines is captured, they will be conveyed as authoritative and educational statements of the standard of care.

In almost all cases, it's more important to reduce latency and increase transparency (i.e, publish faster but with references) than to simplify or operationalize in order to improve uptake. Most doctors in dynamic fields don't need the simplification; they rely on life-long self-discipline and diligence to overcome difficulty in the material, and use guidelines at most as a framework for communication and completion, i.e., for knowing when they're addressed known concerns.

Structured guidelines mainly enable outsiders to observe and control in ways that are likely to be unproductive.


The fundamental idea here is that doctors find it difficult to ensure that their recommendations are actually up-to-date with the latest clinical research.

Further, that by virtue of being at the centre of action in research, doctors in prestige medical centres have an advantage that could be available to all doctors. It's a pretty important point, sometimes referred to as the dissemination of knowledge problem.

Currently, this is best approached by publishing systematic reviews according to the Cochrane Criteria [0]. Such reviews are quite labour-intensive and done all too rarely, but are very valuable when done.

One aspect of such reviews, when done, is how often they discard published studies for reasons such as bias, incomplete datasets, and so forth.

The approach described by Geiger in the link is commendable for its intentions but the outcome will be faced with the same problem that manual systematic reviews face.

I wonder if the author considered included rules-based approaches (e.g. Cochrane guidelines) in addition to machine learning approaches?

[0] https://training.cochrane.org/handbook


Hey author here--Cochrane reviews are great.

NCCN guidelines and Cochrane Reviews serve complementary roles in medicine - NCCN provides practical, frequently updated cancer treatment algorithms based on both research and expert consensus, while Cochrane Reviews offer rigorous systematic analyses of research evidence across all medical fields with a stronger focus on randomized controlled trials. The NCCN guidelines tend to be more immediately applicable in clinical practice, while Cochrane Reviews provide a deeper analysis of the underlying evidence quality.

My main goal here was to show what you could do with any set of medical guidelines that was properly structured. You can choose any criteria you want.


It amazes me that AI isnt a borderline requirement for being a doctor. Think of how much info is outdated or just wrong.

There's a lot of outdated and wrong info coming from AI tools as well. It's the natural outcome of training on a large historical dataset.

> doctors find it difficult to ensure that their recommendations are actually up-to-date with the latest clinical research

Doctors care about as much this as software engineers care about the latest computer science research. A few curious ones do. But the general attitude is they already did tough years of school so they don’t have to anymore.


I worked with oncologists and this isn’t true.

Oncology has a rapidly changing treatment landscape and it’s common for oncologists to be discussing the latest paper that has come out.

If you’re an oncologist and not keeping up with the literature you’re going to be out of date in your decisions in about 6 months from graduation.


Funny enough that last paragraph is also said of software engineers too. Neither are true.

Unlike oncologists, people won't die if you don't keep up on the latest programming techniques.

Yeah, non-programmers seem to think everything is changing so quickly all the time yet here I am writing in a 40 year old language against UNIX APIs from the 70s ¯\_(ツ)_/¯

Same reason why datasheets are still PDFs. It's a reliable, long lasting and portable format. And while it's kind of ridiculous that we are basically emulating paper, no other format fills that niche.

It's the niche HTML should be able to fill, since that was its original purpose, but isn't, since all focus over the last 20 or so years has been on everything else, but making HTML a better format for information exchange.

Trivial things like bundling up a complex HTML document into a single file don't have standard solutions. Cookies stop working when you are dealing with file:// URLs and a lot of other really basic stuff just doesn't work or doesn't exist. Instead you get offshot formats like ePUB that are mostly HTML, but not actually supported by most browser.


I really wish there was a standard way to package html with JavaScript for offline viewing, basically to treat it like a pdf but with fancy things like interactive controls.

This is how we end up with 300MB HTML files because companies offload datasheet creation to bootcampmill grads instead of technical writers.

> With properly structured data, machines should be able to interpret the guidelines. Charting systems could automatically suggesting diagnostic tests for a patient. Alarm bells and "Are you sure?" modals could pop up when a course of treatment diverges from the guidelines. And when a doctor needs to review the guidelines, there should be a much faster and more natural way than finding PDFs

I have implemented this computerized process twice at two different startups over the past decade.

I would not want the NCCN to do it.

The NCCN guidelines are not stuck in PDFs, they are stuck in the heads of doctors.

Once the NCCN guidelines get put into computerized rules, they start to be guided by those computerized rules, a second influence that takes them away from the fundamental science.

So while I totally agree that there should be systemtticization of the rules, it should be entirely secondary and subservient to the best frontier knowledge about cancer, which changes extremely frequently. Annually after every ASCO (major pan-cancer conference) and every disease specific conference (e.g. the San Antonio breast cancer conference), and occasionally during the year when landmark clinical trials are published the doctors need to update their knowledge from the latest trials and their continuing medical education, which is entire body of knowledge that is complementary to the edges of what the NCCN publishes.

Having spanned both computer science and medicine for my entire career, I trust doctors to be able to update their rules far faster than the programmers and databases.

Please do not get the NCCN guidelines stuck in spaghetti code that a few programmers understand, rather than open in PDFs with lots of links that anybody can go and chase after.

Edit: though give me a week digesting this article and I may change my mind. Maybe the NCCN should be standardizing clinical variables enough such that the rules can trivially be turned into rules. That would require that the hypotheses that a clinical trial fits into those rules however, and that's why I need a week of digestion to see if it may even be possible...


Decision trees work for making decisions...

But they don't work as well as other decisionmaking techniques... Random forests, linear models, neural nets, etc. are all decision making techniques at their core.

And decision trees perform poorly for complex systems where lots of data exists - ie. human health.

So why are we using a known-inferior technique simply because it's easier to write down in a PDF file, reason about in a meeting, or explain to someone?

Shouldn't we be using the most advanced mathematical models possible with the highest 'cure' probability, even if they're so complex no human can understand them?


> complex systems where lots of data exists

Not a lot of high quality data exists for human health. Clinical guidelines for many diseases are built around surprisingly scant evidence many times.

> even if they're so complex no human can understand them?

That’ll be wonderful to explain in court when they figure out it was just data smuggling or whatever other bias.


In cancer there's an abundance of clinical trials with high quality data, but it is all very complex in terms of encoding what the clinical trial actually encoded.

Go to a clinical cancer conference and you will see the grim reality of 10,000s of people contributing to the knowledge discovery process with their cancer care. There is an inverse relationship between the number of people in a trial and the amount of risk that goes into that trial, but it is still a massive amount of data that needs to be codified into some sensible system, and it's hard enough for a person to do it.

> That’ll be wonderful to explain in court when they figure out it was just data smuggling or whatever other bias.

What do you mean by this? I'm not aware of any data smuggling that has ever happened in a clinical trial. The "bias" is that any research hypothesis comes from the fundamentally biased position of "I think the data is telling me this" but I've seen very little bias of truly bad hypotheses in cancer research like those that have dominated, say Alzheimer's research. Any research malfeasance should be prosecuted to the fullest, but I don't think cancer research has much of it. This was a huge scandal, but I don't think it pointed to much in the way of bad research in the end:

https://www.propublica.org/article/doctor-jose-baselga-cance...


By smuggling and bias I meant in an ML model. Smuggling was a bit informal, but referring to models overfit on unintended features or artifacts.

but we have well established ways to deal with those... test/validation sets, n-fold validation, etc.

Even if there was some overfitting or data contamination that was undetected, the result would most probably still be better than a hand-made decision tree over the same data...


Ok, until you can sue the AI you need to find a doctor ok putting their license behind saying “I have no idea how this shiny thing works”. There are indeed some that will, but not a consensus.

Great. Lets have the few that are willing to put their license behind that do so, and then a study showing that those people get better cure rates than those who do not...

And then the decision tree can be rewritten as "Do ${New_method}".


Hand-made decision trees are open to inspection, comprehension, and adaption. There is no way to adapt an opaque ML model to new findings / an experimental treatment except by producing a new model.

Dinner generation is usually based on decision tree models as well, so they match the resolution of the available data.

The practice of real world medicine often interpolates between these data points.


Models too complex for humans to understand don't, in practice, have a high 'cure' probability.

As a cancer researcher myself, I'd point out that some branches of the decision trees in the NCCN guidelines are based on studies in which multiple options were not statistically significantly different, but all were better than the placebo. In those cases, the clinician is free to use other factors to decide which arm to take. A classic example of this is surgery vs radiation for prostate cancer. Both are roughly equally effective, but very different experiences.

Cool tool. From my experience the PDF was easy to traverse.

The hardest part for me was understanding that treatment options could differ (i.e. between the _top_ hospitals treating the cancer). And there were a few critical options to consider. NCCN paths were traditional, but there is in between decisions to make or alternative paths. ChatGPT was really helpful in that period. "2nd" opinions are important... but again you ask the top 2 hospitals and they differ in opinion, any other hospital is typically in one of those camps.


It's so much worse than you could possibly imagine. I worked for a healthcare startup working on patient enrollment for clinical oncology trials. The challenges are amazing. Quite frankly it wouldn't matter if the data were in plaintext. The diagnostic codes vary between providers, the semantic understanding of the diagnostic information has different meanings between providers, electronic health records are a mess, things are written entirely in natural language rather than some kind of data structure. Anyone who's worked in healthcare software can tell you way more horror stories.

I do hope that LLMs can help straighten some of it out but anyone whos done healthcare software, the problems are not technical, they are quite human.

That being said one bright spot is we've (my colleagues, not me) made a huge step forward using category theory and Prolog to discover the provably optimal 3+3 clinical oncology dose escalation trial protocol[1]. David gave a great presentation on it at the Scryer Prolog meetup[2] in Vienna.

It's kind of amazing how in the dark ages we are with medicine. Even though this is the first EXECUTABLE/PROGRAMMABLE SPEC for a 3+3 cancer trial, he is still fighting to convince his medical colleagues and hospital administrators that this is the optimal trial because -- surprise -- they don't speak software (or statistics).

[1]: https://arxiv.org/abs/2402.08334

[2]: https://www.digitalaustria.gv.at/eng/insights/Digital-Austri...


David has since also made extensive progress with his category-theoretic approach, available in a new repository:

https://github.com/Precisfice/DEDUCTION


Have you read Jake Seliger’s pieces on oncology clinical trials https://jakeseliger.com/.

Oh wow. No, that's heart breaking. I'll have to read up on this. Reminds me of David explaining the interesting and somewhat surprisingly insensitive language the oncology literature uses towards folks going through this. Its there for historical reasons but slow to change.

It also shows how important getting dose escalation trials are. The whole point is finding the balance point where "cure is NOT worse than the disease". A bad dose can be worse than the cancer itself, and conducting the trials correctly is extremely important... and this really underscores the human cost. Truly heartbreaking :(


This is a fascinating idea!

I find the web(HTML/CSS) the most open format for sharing. PDFs are hard to be consumed on smaller devices and much harder to be read by machines. I am working on a feature at Jaunt.com to convert PDFs to HTML. It shows up as reader mode icon. Please try it out and see if it is good enough. I personally think we need to do much better job. https://jaunt.com

PDFs can be notoriously difficult to work with on smaller devices

Oncologists making treatment decisions are generally using real computers, not toy mobile devices.

I know it's not the same, but in many areas we have this "follow the arrows" system in many guidelines. For some examples, see the EULAR guidelines with it's fluxograms for treatments and also AO Surgery Reference with a graphical approach to select treatments based on fracture pattern, avaliable materials and skill set.

I think that's a logical and necessary step to join medical reasoning and computer helpers, we need easier access to new information and more importantly to present clinical relevant facts from the literature in a way that helps actual patient care decision making.

I'm just not too sure we can have generic approaches to all specialties, but it’s nice seeing efforts in this area.


Software that gives treatment instructions may be a medical device requiring FDA approval. You may be breaking the law if you give it to a medical professional without such approval.

Why can this not just be a website? Isn‘t this a perfect use case for HTML and hyperlinks?

In the UK guidelines are published by the National Institute for Clinical Excellence. Guidance is available to all in html and pdf formats.[0]

[0] https://www.nice.org.uk/guidance


Profit.

I use these guidelines all the time in the PDF format for free, and I'd love to have these in a structured format. For $3000/year you could get 50 users access to PDF prescription templates to speed up their work. That's not bad, but it's still all PDF.

For the nice low price of "contact us for pricing," though, you could have EMR integration. They couldn't justify $$$$ for EMR integration if all this information is easily accessible.

https://www.nccn.org/compendia-templates/nccn-templates-main...


The real problem is that the guidelines are written for humans in the first place. Workarounds like this shouldn't be needed, to go from a machine friendly layout to a human friendly one is usually quite easy.

And from what he says a decision tree isn't really the right model in the first place. What about no tree, just a heap of records in a SQL database. You do a query on the known parameters, if the response comes back with only one item in the treatment column you follow it. If it comes back with multiple items you look at what would be needed to distinguish them and do the test(s).


The idea of adding hallucination to medical advice seems very dangerous.

There’s also a regression-to-the-mean problem, the systems really shouldn’t optimize just for the easier cases. I wonder if that’s a direct tradeoff, I think maybe it is with the kinds of things I see used to tweak out hallucinations.

Forgive me if I'm mistaken, but isn't this exactly what the FHIR standard is meant to address? Not only does it enable global inter-health communication using a standardized resource, but it's already adopted in several national health services, including (but not broadly), America. Is this not simply a reimplementation, but without the broad iterations of HL7?

Right, it would make more sense to use HL7 FHIR (possibly along with CQL) as a starting point instead of reinventing the wheel. Talk to the CodeX accelerator about writing an Implementation Guide in this area. The PlanDefinition resource type should be a good fit for modeling cancer guidelines.

https://codex.hl7.org/

https://www.hl7.org/fhir/plandefinition.html


This is the comment I was looking for.

You would aim to use CQL expressions inside of a PlanDefinition, in my estimate. This is exactly what AHRQ's, part of HHS, CDS Connect project aims to create / has created. They publish freely accessible computable decision support artifacts here: https://cds.ahrq.gov/cdsconnect/repository

When they are fully computable, they are FHIR PlanDefinitions (+ other resources like Questionnaire, etc) and CQL.

Here's an example of a fully executable Alcohol Use Disorder Identification Test: https://cds.ahrq.gov/cdsconnect/artifact/alcohol-screening-u...

There's so much other infrastructure around the EHR here to understand (and take advantage of). I think there's a big opportunity in proving that multimodal LLM can reliably generate these artifacts from other sources. It's not the LLM actually being a decision support tool itself (though that may well be promising), but rather the ability to generate standardized CDS artifacts in a highly scalable, repeatable way.

Happy to talk to anyone about any of these ideas - I started exactly where OP was.


I downloaded and opened an CDS for osteoporosis from the link (as a disease in my specialty), I need an API key to view what a "valueset" entails, so in practice I couldn't assert if the recommendation aligns with clinical practice, nor in the CQL provided have any scientific references (even a textbook or a weak recommendation from a guideline would be sufficient, I don't think the algorithm should be the primary source of the knowledge)

I tried to see if HL7 was approachable for small teams, I personally became exhausted from reading it and trying to think how to implement a subset of it, I know it's "standard" but all this is kinda unapproachable.


You can register for a free NLM account to access the value sets (VSAC). HL7 standards are approachable for small teams but due to the inherent complexity of healthcare it can take a while to get up to speed. The FHIR Fundamentals training course is a good option for those who are starting out.

https://www.hl7.org/training/fhir-fundamentals.cfm?ref=nav

It might seem tempting to avoid the complexity of FHIR and CQL by inventing your own simple schema or data formats for a narrow domain. But I guarantee that what you thought was simple will eventually grow and grow until you find that you've reinvented FHIR — badly. I've seen that happen over and over in other failed projects. Talk to the CodeX accelerator I linked above and they should be able to get you pointed in the right direction.


I parsed some mind maps that were constructed with a tool and exported as pdfs (original sources were lost a long time ago) and I used python with tesseract for the text and opencv and it worked alright. I am curious why the author went with LLMs, but I guess with the mentioned amount of data it wasn't hard to recheck everything later.

The real question is: why is everything stuck in PDFs, and the more important meta-question is: why don't PDFs support meta-data (they do, somewhat). So much of what we do is essentially machine-to-machine, but trapped in a format designed entirely for human-to-human (also lump in a bit of machine-to-human).

Adobe has had literally a third of a century to recognize this need and address it. I don't think they're paying attention :-/


PDFs can have arbitrary files embedded, like XML and JSON. It also supports a logical structure tree (which doesn’t need to correspond to the visual structure) which can carry arbitrary attributes (data) on its structure elements. And then there’s XML Forms. You can really have pretty much anything machine-processable you want in a PDF. One could argue that it is too flexible, because any design you can come up with that uses those features for a particular application is unlikely to be very interoperable.

Nice! I looked for meta data and found only an anemic thing, but embedding a whole file with structured data makes perfect sense.

But of course this only pushes the responsibility back a step: why the heck isn’t Adobe pushing developers to include structured data in their output?

Every time you “Save as PDF” there should be a checkbox defaulted on to “Save Data to PDF”.

They’re already done the first thing; why not do the second thing to make everyone’s life easier?


Adobe wants you to purchase Acrobat to be able to do that. Their strategy is to give you Reader for free, but for authoring they want you to buy their software. However there’s also third-party PDF software one can use for that. And apparently Google Drive supports it too: https://www.wikihow.com/Attach-a-File-to-a-PDF-Document

PDFs are essentially compressed Postscript, which is Turing complete, so a PDF in theory can do anything you want.

The FHIR clinical reasoning specs provide machine-oriented formats to talk about diagnostic flows. But the actual science is going to be PDFs for a while, and translating them into machine-friendly formats is going to be extra work for years and years.

Because writers don't think about readers. PDF is one of the worst formats for science/technical info, but yet. I've dumped a lot of papers from arxiv because it formatted as 2-column non zoomable PDF.

> The whole set of guidelines for a type of cancer breaks down into a few disjointed directed graphs

Nothing undermines medicine quite so thoroughly as yet another astronaut trying to force it into a data structure.


Comically, I worked in this space and initially tried to get decision support working with data structures and code sets and such.

I ended up only really contributed adding version numbers to the pdf. So at least people knew they had the latest and same versions. And that took a year, to get versions added to guideline pdfs.


That is wild, one would think versioning is extremely important. They tend to just put the timestamp in the filename (sometimes), which I guess is better than nothing.

Don't signed PDFs include a timestamp, however?


Getting in the file name was kind of easy. But I meant adding it visually in the pdf guidance so readers could tell. Just numbers in the lower left corner. Or maybe right.

The guideline was available via url so the filename couldn’t change.


Yeah I know what you meant, I agree, it's awful.

Would love to take a look at the code, in particular at how the data extraction and transformation is implemented.

As a side note, the German associations of oncology publish their guidelines here (HTML and SVG graphs): https://www.onkopedia.com/de/onkopedia/guidelines


Excellent read. This consolidated and catalyzed my my spurious thoughts around personal information management. The input is generally markdown/pdf but over time highly useless for a single person. Thete would be value if it is passed through such a system over time.

Shouldn't the end goal be just to train an ai on all the pdfs and give the doctors an interface to plug in all the details and get a treatment plan generated by that ai?

Working on the data structure feels like an intermediate solution on the way to that ai which is not really necessary. Or am I missing something?


I am not sure patients and doctors are interested in adding hallucination generators to the list of their problems.

AI/ML techniques in medicine have been applied clinically since at least the 90s. Part of the reason you don't see them used ubiquitously is a combination of a) it hasn't worked all that well in many scenarios so far and b) medicine is by nature quite conservative, for a mix of good and not so good reasons.

Your end goal maybe. Not patients or doctors goal for sure.

How does your treatment AI get its liability insurance?

What you're missing is there is no evidence that using AI models results in better patient outcomes than simpler clinical decision support tools.

The author makes a great case for machine-interpretable standards but there is an enormous amount of work out there devoted to this, it’s been a topic of interest for decades. There’s so much in the field that a real problem is figuring out what solutions match the requirements of the various stakeholders, more than identifying the opportunities.

Why is anything stuck in PDFs?

PDFs are just good for just one thing: printing. Data stored as PDF is meant to be printed and not being processed by any other means.


You might be interested in checking out the WHO SMART Guidelines. Nothing on cancer yet AFAIK, but it's evolving.

I was also thinking about FHIR and SMART guidelines.

But the whole system is mess. And the whole SMART guideline system is controlled by 2-3 gatekeepers who don’t listen to any ideas other than their own


GraphViz has some useful graph schema languages that could be reused for something like this. There's DOT, a delightful DSL, and some kind of JSON format as well. You can then generate a bunch of different output formats and it will lay out the nodes for you.

Of all the challenges with this, graph layout is beyond trivial. It does not rank as a problem, intellectual challenge, or even that interesting.

The challenges are all about what goes in the nodes, how to define it, how to standardize it across different institutions, how to compare it to what was tested in two different clinical trials, etc. And if the computerized process goes into clinical practice, how is that node and its contents robustly defined so that a clinician sitting with a patient can instantly understand what is meant by it's yes/no/multiple choice question in terms that have been used in recent years at the clinician's conferences.

Addressing the challenges of constructing the graph requires deep understanding of the terms, deep knowledge of how 10 different people from different cultural backgrounds and training locations interpret highly technical terms with evolving meanings, and deep knowledge of how people could misunderstand language or logic.

These guidelines codify evolving scientific knowledge where new conceptions of the disease get invented at every conference. It's all at the edge of science where every month and year we have new technology to understand more than we ever understood before, and we have new clinical trials that are testing new hypotheses at the edge of it.

Getting a nice visual layout is necessary, but in no way sufficient for what needs to be done to put this into practice.


Not ... even that interesting?

Modularity is an excellent way of attacking complex problems. We can all play with algorithms that can carry on realistic conversations and create synthetic 3D movies, because people worked on problems like making transistors the size of 10 atoms, figuring out how processors can predict branches with 99% accuracy, giving neural nets self-attention, deploying inexpensive and ridiculously fast networks all over the planet, and a lot of other stuff.

For many of us, curing cancer may someday become more important than almost anything else a computer can help us to do. It's just there are so many building blocks to solving truly complex problems; we must respect all that.


I have to ask: did the author contact any medical professional when writing this article? Is this really something that needs to be fixed, and will his solution actually fix it?

It seems to me that ignoring the guideline is a physician decision, and when it is ignored (for good or for bad), it is not because the guidelines are not available in json.


Gee, before talking about complex stuff like decision trees, how about we start with something really simple like not requiring a login to download the stupid PDF from NCCN?

super cool post until i saw you used an llm to do this

This is all predicated on the guidelines actually reflecting best practices

PDFs are a universal, machine readable format.

PDFs are the opposite of machine-readable if you want to do anything other than render them as images on paper or a screen. They're only slightly more machine-readable than binary executables.

I hate, hate, hate, hate, hate the practice of using PDFs as a system of record. They are intended to be a print format for ensuring consistent typesetting and formatting. For that, I have no quarrel. But so much of the world economy is based on taking text, docx (XML), spreadsheets, or even CSV files, rendering them out as PDFs, and then emailing them around or storing them in databases. They've gone from being simply a view layer to infecting the model layer.

PDFs are a step better than passing around screenshots of text as images - when they don't literally consist of a single image, that is. But even for reasonably-well-behaved, mostly-text PDFs, finding things like "headers" and "sections" in the average case is dependent on a huge pile of heuristics about spacing and font size conventions. None of that semantic structure exists, it's just individual characters with X-Y coordinates. (My favorite thing to do with people starting to work with PDFs is to tell them that the files don't usually contain any whitespace characters, and then watch the horror slowly dawn as they contemplate the implications.) (And yes, I know that PDF/A theoretically exists, but it's not reliably used, and certainly won't exist on any file produced more than a couple years ago.)

Now, with multi-modal LLMs and OCR reaching near-human levels, we can finally... attempt to infer structured data back out from them. So many megawatt-hours wasted in undoing what was just done. Structure to unstructure to structure again. Why, why, why.

As for universality... I mean, sure, they're better than some proprietary format that can only be decrypted or parsed by one old rickety piece of software that has to run in Win95 compatibility mode. But they're not better than JSON or XML if the source of truth is structured, and they're not better than Markdown or - again - XML if the source is mostly text. And there are always warts that aren't fully supported depending on your viewer.


They’re only machine-readable in the very weak sense that all computer files are machine-readable.

Funny i just had the thought the other day about how we as a society need to move past the pdf format or even just update it to be editable in traditional document software. The fact that Google docs will export as a pdf and not have it saved in the documents is proof its gotten to a point of inefficiency and that's just one example

a decision tree is just a csv trapped in amber. share the actual data

WAIT ... Hole up... what have we here: https://www.nccn.org/compendia-templates/compendia/nccn-comp...

TLDR: The NCCN surely has a clean pretty database of these algorithms. They output these junky pdfs for free. Want cleaner "templates" data? Pay the toll please.

What we have here is a walled garden. Want the treatment algorithm? Here muck through this huge disaster of 999 page pdfs. Oh you want the underlying data? Well, well, it's going to cost you.

What we have here is not so much different than the paywalls of an academic journal. Some company running a core service to an altruistic industry and skimming a price. OP is just writing an algorithm to unskim it. And nobody can really use it without making the thing bulletproof lest a physician mistreat a cancer.

To my sentiment this is yet another unethical topic in healthcare. These clunky algorithms, if a physician uses them, slows the process and introduces a potential source of error, ultimately harming patients. Harming patients for increased revenue. The physicians writing and maintaining the guidelines look the other way given they get a paycheck off it, plus the prestige of it all, similar to some scenarios in medicine itself.

The natural thing to do is crack open the database and let algorithms utilize it. This whole thing of dumping data in an obstruse and machine-challenging format, then a rube goldberg machine to reverse the transformation, it's not right.

Anyway I mention this because there seems to be a thought of "these pdfs are messy lets clean them" without looking at what's really going on here.


OP is talking about the NCCN Guidelines, which doesn't seem to be available in other formats or API. From their website:

NCCN Clinical Practice Guidelines in Oncology (NCCN Guidelines®): The NCCN Guidelines® document evidence-based, consensus-driven management to ensure that all patients receive preventive, diagnostic, treatment, and supportive services that are most likely to lead to optimal outcomes.

Format(s) Available for Licensing: PDF API not available


> With properly structured data, machines should be able to interpret the guidelines.

Yeah, right. And then say "Die". /s

The guidelines shall be structured properly. It is not rocket science.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: