I enjoy seeing projects like this! If you expand beyond arxiv, keep in mind sinc...

Quizzical4230 · 2024-12-25T15:02:29 1735138949

Thank you for the appreciation and great feedback!

| If you expand beyond arxiv, keep in mind since coverage matters for lit reviews,

I do have PaperMatchBio [^1] for bioRxiv and PaperMatchMed [^2] for medRxiv, however I do agree having multiple sites for domains isn't ideal. And I am yet to create a synchronization pipeline for these two so the results may be a little stale.

| unfortunately the big publishers (Elsevier and Springer) are forcing other indices like OpenAlex, etc. to remove abstracts so they're harder to get.

This sounds like a real issue in expanding the coverage.

| Have you checked out other tools like undermind.ai, scite.ai, and elicit.org?

I did, but maybe not thoroughly enough. I will check these and add complementing features.

| You might consider what else a dedicated product workflow for lit reviews includes besides search

Do you mean a reference management system like Mendeley/Zotero?

[1]: https://papermatchbio.mitanshu.tech/ [2]: https://papermatchmed.mitanshu.tech/

eric-burel · 2024-12-25T15:16:27 1735139787

Unusual use case but I write literature reviews for French R&D tax cut system, and we specifically need to: focus on most recent papers, stay on topic for a very specific problematic a company has, potentially include grey literature (tech blog articles from renowned corp), be as exhaustive as possible when it comes to freely accessible papers (we are more ok with missing paid papers unless they are really popular). A "dedicated product workflow" could be about taking business use cases like that into account. This is a real business problem, the Google Scholar lock up is annoying and I would pay for something better than what exists.

dbmikus · 2024-12-26T01:40:38 1735177238

Hey, I'm not OP, but I'm working on what seems to be the exact problem you mentioned. We (https://fixpoint.co/) search and monitor web data about companies. We are indexing patents and academic papers right now, plus we can scrape and monitor just about any website (some social media sites not supported).

We have users with very similar use cases to yours. Want to email me? dylan@fixpoint.co. I'm one of the founders :)

Quizzical4230 · 2024-12-25T15:56:13 1735142173

This is quite unique. I believe a custom solution might help you better than Google Scholar.

eric-burel · 2024-12-25T20:42:25 1735159345

This can be seen as technology watch, as opposed to a thesis literature review for instance. Google Scholar gives the best results but sadly doesn't really want you to build products on top of it : no api, no scraping. Breaking this monopoly would be a huge step forward, especially when coupled with semantic search.

mattigames · 2024-12-26T00:22:07 1735172527

"|" it's a terrible character for signaling quotes, as it looks a bit too much like "I" or "l" and sometimes even "1" or "i" depending on the font used. I believe the greater-than symbol (>) is better suited for this task.

Quizzical4230 · 2024-12-26T05:08:57 1735189737

So true ;-; I was following the Gmail protocol. I will use > from now on. Happy Holidays :D

zackmorris · 2024-12-26T16:34:41 1735230881

Edit: I moved this here from top level.

The Cloudflare challenge screen at the beginning is a dealbreaker.

Random question - does anyone know why so many papers are missing from ArXiv? Do they need to be submitted manually, perhaps by their author(s)? I'll often find papers on mathematics, physics and computer science. But papers on biology, chemistry and medicine are usually missing.

I think a database of all paper ids in existence and where they're posted or missing could be at least as useful as this. Because no papers written with any level of public funding (meaning most of them) should ever be missing.

Quizzical4230 · 2024-12-26T19:01:50 1735239710

> The Cloudflare challenge screen at the beginning is a dealbreaker.

I understand your concern, however, I do not have the know-how to properly combat bots that keep spamming the server and this seemed the easiest way for me to have a functional site. I would love to know some resources for beginners in this regard, if you have them.

>Random question...

arXiv is generally for submitting CS, maths and physics papers. There are alternate preprint repositories like biorxiv.org, chemrxiv.org and medrxiv.org for such purposes. Note: arxiv is the largest, in terms of papers hosted, among these.

zackmorris · 2024-12-27T19:33:06 1735327986

Edit: thanks for those links! I'm somewhat out of the loop academically, so have been relying on search engines whose quality seems to be in decline.

-

Combatting bots with the Cloudflare challenge screen is an X/Y problem.

The central issue is that the web has been rolled out improperly, and the way that we build websites is incorrect. The web should have been decentralized, meaning that all public-facing pages would be public domain and hosted on a peer to peer (P2P) network that grows more powerful with the number of users, similarly to how BitTorrent works. We wouldn't concern ourselves with servers at the edge, since they would already be distributed around the world and implement the caching strategies that are already part of HTTP.

Which means for example that regions in AWS would be unnecessary, and Cloudflare and other content distribution networks (CDNs) would have no business model. Coral CDN was a free working example of automatic caching that ran up until a few years ago:

https://wiki.opensourceecology.org/wiki/Coral_CDN

https://en.wikipedia.org/wiki/Coral_Content_Distribution_Net...

https://cachedview.com

https://news.ycombinator.com/item?id=19020978

Note how it's mostly been erased from history due to ensh@ttification by FAANG.

It also means that web technologies we think of as core to how external resources are included are also incorrect. Rather than Cross-Origin Resource Sharing (CORS), we should be using Subresource Integrity (SRI). That would allow us to include scripts and other media files by hash instead of just location. That also removes most of the need for build processes like Webpack, Grunt, Gulp, etc, since scripts would import other scripts directly and let the Just in Time (JIT) compiler decide what is needed.

I can go on pretty much forever with this. In 1995 I was a student at the University of Illinois in Urbana-Champaign (UIUC) where NCSA Mosaic was developed, which Netscape copied the year before when it took the internet mainstream. Stuff like Server-Side Includes (SSI) showed promise in avoiding build tools by letting developers reuse code from other servers. But there wasn't full understanding then of how hashing makes strong security guarantees. In the meantime, Marc Andreessen and other billionaires took the quick and easy path, rolling out easier (but not simpler) technologies that maximize short-term profits instead of long-term prosperity and ease of maintenance through automation.

Without a true distributed web, the endgame of all this looks like what we're seeing today. Sites that can't be scraped by alternative search engines or machine learning tools. Sites that can't be viewed securely or anonymously with Tor Browser. Sites that keep everything behind a paywall or in walled gardens, which will cause most of today's human-produced media to eventually be lost to the digital dark age.

Fixing all of this is straightforward, but it would probably require us to return to traditional values. Basically contributing some of our incomes to universities and other institutions via our taxes, so that they can work to protect the interests of the masses, who have no benefactor because it's not profitable to help them.

Billionaires and other moneyed interests don't want this, so have done everything in their power to dismantle the commons, not just on the web, but through regulatory capture to sell off public lands and other resources currently owned by everyone:

https://www.snopes.com/fact-check/elon-musk-stop-donating-wi...

Which means that this is really a cultural issue, so many of us can't see the problems or solutions without challenging our most closely-held beliefs, which creates cognitive dissonance. So even though the fixes appear obvious, they are effectively out of reach for the foreseeable future because it's easier to sabotage the system than reform it.

None of this helps you immediately though. You might be able to move from Cloudflare to a free and open source alternative like CloudFIRE, although it looks like they are copying many of its same mistakes, for example "fake browser detection and blocking" which is at the top of their list of priorities:

https://github.com/coinkite/cloudfire

I'm having trouble finding other alternatives:

https://news.ycombinator.com/item?id=34800182

So this is what I mean. If you are really interested in empowering large groups of people with free access to information, then you will be running up against the full might and momentum of the status quo.

Something that gives me hope is that most hackers and makers were originally drawn to tech as a lifeline out of subjugation doing mundane and pointless work. Tech is inherently antiauthoritarian. So all it would take is a single wealthy individual, a single internet lottery winner, to fund efforts to reevaluate what underpins the status quo from first principles. It might not take much to deliver tech which can't be unseen, which routes around artificial scarcity. We can imagine providing resources through automation, outside of any profit motive. Until then, large groups of individuals will have to keep contributing to these efforts on their own dime at a snail's pace, with what little motivation they have left after working their lives away to make rent and enrich the already wealthy.

Apologies for the wall of text, but it's the holidays so why not.

shishy · 2024-12-26T19:58:31 1735243111

There are other preprint servers. But to your question, there are centralized indices that track all papers.

DOI is the primary identifier and preprints are also issuing them now.

Crossref has papers by DOI. OpenAlex and SemanticScholar also have records, with different id types supported (doi, pmid, etc).

immibis · 2024-12-26T15:53:45 1735228425

There's always [redacted due to copyright infringement policy].se?