Hacker News new | past | comments | ask | show | jobs | submit login

"Overview of SHARD: A System for Highly Available Replicated Data" it's the first paper to introduce the concept of database sharding. It was published in 1988 by the Computer Corporation of America.

It is referenced hundreds of times in many classic papers.

But, here's the thing. It doesn't exist.

Everyone cites Sarin, DeWitt & Rosenb[e|u]rg's paper but none have ever seen it. I've emailed dozens of academics, libraries, and archives - none of them have a copy.

So it blows my mind that something so influential is, effectively, a myth.




Huh. OK, here's something that might be interesting. I found another paper[1] that cites SHARD, but the citation is slightly different. Instead of being a CCA memo, it shows it as a Xerox memo:

Sunil Sarin, Mark DeWitt, and Ronni Rosenberg, "Overview of SHARD: A System for Highly Available Replicated Data," Technical Report 162, Xerox Advanced Information Technology (May 1988).

EDIT:

OK, I think I get this now. I had read the Wikipedia blurb about CCA being acquired by Rocket earlier, but only just now did I keep reading further down to find this bit:

in 1984, CCA was purchased by Crowntek, a Toronto-based company.[8] Crowntek sold Computer Corporation of America's Advanced Information Technology division to Xerox Corporation in 1988.[9] The balance of CCA was acquired by Rocket Software, a Boston-based developer of enterprise infrastructure products,[2] in April 2010.

So it seems like the portion of CCA that would be of interest here, is probably the bit that sent to Xerox. Maybe somebody at Xerox could help turn up the missing document?

I doubt it will help, but I took a stab at pinging them on <strike>Twitter</strike> X.

https://fogbeam.com/tweet_xerox_cca_paper.png

[1]: https://apps.dtic.mil/sti/pdfs/ADA209126.pdf


I found LinkedIn profiles for Sunil Sarin, Mark Dewitt, and Ronni Rosenberg who all worked at CCA during this time period.

I've gone ahead and sent them each a message asking if they might be able to make the paper available.

If you'd like to get in contact with them yourself and are having trouble finding their LinkedIn, shoot me an email and I'll be happy to provide you links.


I received a response from Dr. Rosenberg:

> Yes, I was involved, 35 years ago! I believe it was an internal CCA paper. I don't have a copy and I have no idea how to get it. Sorry about that. It does seem to be the earliest reference to "shard" in the DB context. (The other early reference pointed to in Wikipedia is from much later, 1997.)

> Fortunately, you need not go back 35 years to read about sharding; it's easy to get current info. Cheers.

I've now sent a message to Andy Youniss, CEO of Rocket Software to see if he can help.


I've now sent a message to Andy Youniss, CEO of Rocket Software to see if he can help.

I suspect that if that memo lives on anywhere, it's somewhere in the bowels of Xerox. I say that based on observing that:

1. In the paper by Ronni L. Roseniberg at https://apps.dtic.mil/sti/pdfs/ADA209126.pdf the citation to the SHARD paper changed to

Sunil Sarin, Mark DeWitt, and Ronni Rosenberg, "Overview of SHARD: A System for Highly Available Replicated Data," Technical Report 162, Xerox Advanced Information Technology (May 1988).

2. Per Wikipedia[1] Crowntek sold Computer Corporation of America's Advanced Information Technology division to Xerox Corporation in 1988.

To me this suggests that it was the "Advanced Information Technology division" specifically which would have had the paper in question, and that bit of CCA wound up with Xerox.

That said, it can't hurt to reach out anybody connected to this in any way. You never know who will wind up "knowing a guy who knows a gal, who knows a ..." or whatever.

[1]: https://en.wikipedia.org/wiki/Computer_Corporation_of_Americ...


I also sent a note to another former Computer Corporation of America employee that I found on LinkedIn. I don't know them personally, but we live near each other and have some common connections, so maybe they will at least receive my unsolicited message with some favor.


FWIW, I did get a response from the individual mentioned above. However they were unable to help with finding a copy of the paper. The search continues...


Someone who isn't me is getting in touch with a computer science librarian at a top 10 university about this paper. I'll update here with whatever I (don't) learn.


No dice. The search continues..


Going through the bibliography of other people's papers and theses, looking for papers that you better cite "for good luck", or because "you gotta cite that one" is a classic PhD student behavior (I've done it) and it's not terribly surprising that something like this can happen. In fact I'd expect it to be much more widespread...


Feeling the need to cite a particular work out of convention or for social reasons is understandable and very common AFAIK, but I'd consider it a part of academic rigour to at least take look at the work one is citing. Blindly citing without ever even laying one's eyes on the work doesn't sound quite right.

Of course if nobody can get their hands on a particular work, as seems to be the case here, that makes things kind of hard. But I'd expect most works you need to cite in a fast-moving field such as CS to be available at least somewhere, even if it takes a bit of effort.


One of the only notes I got from my MS thesis defense was one of the professors being annoyed that I had cited a result from someone else's paper that he had reported (effectively the same but derived differently and less conclusively) in one of his own papers. I added a note referring to his result and a citation to his paper and everybody went home happy.


Sometimes it's just an attempt to do the due diligence of citing the primary reference rather than the reference that cited the reference that cited the primary reference. I experienced this recently with a very widely cited basic fact in a very hard to come by technical report from the early 80s. I'd have bet dollars to cents it did in fact contain the statement everyone claimed it did.. however, just in case I made an inter-library loan request to actually read those couple sentences.


> Sometimes it's just an attempt to do the due diligence of citing the primary reference rather than the reference that cited the reference that cited the primary reference.

True, but in this case you should include both references.


Funny, in my high school literature class I clearly remember being chastised for having sources in my works cited; but not warranting their inclusion with an actual reference in the work.

It's kind of wild that LLMs and other models/sequences will be able to quickly suss out which papers have high levels of referential integrity.


>which papers have high levels of referential integrity

The problem with this, as I see it IMO, is that there could be references that are cited due to their influence on the thought process/writing process of the work - thus citing them gives contextual zeitgeist - and this is something that AI would not be able to muster...

SO a LACK of referential integrity should show that it is written by a human as opposed to an AI.


LaTeX will only include the sources in the bibliography if they are used at least once in the document.


I think this is your blog, so I'm going to post it here because it is so great to look at!

https://shkspr.mobi/blog/2021/06/where-is-the-original-overv...


> It is referenced hundreds of times in many classic papers.

According to Google Scholar, it's cited a measly 11 times .

https://scholar.google.com/scholar?cites=1491448744595502026...


While it may be fair to say the number isn't "hundreds" (or maybe it is?) I will say that I'd take that Google Scholar number with a grain of salt. Just poking around looking for stuff in the spirit of this sub-thread, I've found 4 or 5 additional documents that cite the SHARD paper, and which aren't on that Google Scholar list.

I've found that Google Scholar's coverage gets a little sketchy on older stuff, and since we're talking way back in the 1980's here, I don't think it's surprising that some things are missing.


I am infinitely disappointed to discover you are also the only person who seems to care about this online. I found a website, but it’s you apparently.

Now I’m going to be bugged by this too! Great trivia also a heck of a mystery


It seems like he is not the only one (in case he did not use a pseudonym here): https://en.wikipedia.org/wiki/Talk:Shard_(database_architect...


(Replying directly to increase the chances you see this.) I received a couple more responses, from Sunil Sarin and from Mark Dewitt. Sunil had this to say, which was certainly surprising to me:

> John, I'll have to look in my hard copy archives, which are very disorganized and will take me some time. But I don't believe you need this exact paper. I can probably point you at other papers that were in fact published (and are not just company TRs).

> Furthermore, our concoction of the acronym SHARD is different from "database sharding" as currently used. We were referring to the network, not the data, being "sharded", i.e., partitioned, and ensuring high database availability (with some loss of consistency - see "The CAP Theorem").

Mark Dewitt shared:

> Hi John, I believe that was an unpublished report we wrote for our DARPA funding agency. I may still have a copy, but it's currently buried under some stuff in the garage. There is at least one published paper that describes the SHARD architecture for a partially replicated database. It's not entirely obvious because I think the SHARD acronym didn't start getting used until after the paper was published in 1985. By 1987, Sunil Sarin had published another paper with Nancy Lynch in which they refer to SHARD and reference the 1985 paper.

> Here are the two citations:

> S. K. Sarin and N. A. Lynch, "Discarding Obsolete Information in a Replicated Database System," in IEEE Transactions on Software Engineering, vol. SE-13, no. 1, pp. 39-47, Jan. 1987, doi: 10.1109/TSE.1987.232564.

> S. K. Sarin, B. T. Blaustein and C. W. Kaufman, "System architecture for partition-tolerant distributed databases," in IEEE Transactions on Computers, vol. C-34, no. 12, pp. 1158-1163, Dec. 1985, doi: 10.1109/TC.1985.6312213.


> It is referenced hundreds of times in many classic papers.

Wait, you mean people include papers they haven't even opened in their references?!


Happens far more than you think. It can be an innocent (sort of) mistake, where authors see the citation in a previous paper and simply copy it into their own.


This feels like something that would, at large scale, be unhealthy for science as a whole. While existing papers have already gone through their own quality checks, this enables bad, misleading, or false statements to propagate which can end up being a blow to the credibility of the entire model. Shouldn't there be an ethical duty to due one's due diligence?


To be honest, this is probably one of the less unhealthy common behaviours in academia :^)


> where authors see the citation in a previous paper and simply copy it into their own.

Without opening the paper to even read the abstract? To me it doesn't sound like “innocent” at all, and borderline malpractice…


There's a crowd who tilt the other way -- if I might possibly have hinted at the idea before you then it's borderline malpractice to not reference me. In many fields it's common to directly reference what appear to be the bigger transitive references then, even if they didn't directly influence this work in particular. I'd personally want to see a twidge more evidence before bringing out the pitchforks.


Funny. I read your previous comment (with the ?! ending) as sarcastic. Now I see you were serious. I would be astonished if most authors have actually read even a fraction of the sources they cite.


It's innocent in the sense that it's (usually) not checked out of laziness/complacency as opposed to malicious and intentional citation fraud.


This is really interesting, thanks for posting it here!

On a semi-related topic, I love mysteries like these - mystery songs, those Japanese kanji in Unicode that nobody knows what they mean or where they came from, paper towns on maps.

If anyone else has anything else to read along similar lines, please post it!


Is this an example of reference rot or did it never exist?


If it did exist, there's some delicious irony in an original paper on replicating data in a highly-available manner being lost.


An endless runaway replica wave. Poetic.


Here's something that seems related. Maybe one of these authors would have a copy of the other paper? Not sure if they would be among the set of folks you've already tried or not...

https://apps.dtic.mil/sti/pdfs/ADA171427.pdf


Interesting, thanks. Wonder if it was a coincidence that UO adopted that terminology, or if somebody there knew about this.


It's a good question. I mean, it's not entirely out of the question that the UO folks independently developed the same term without being aware of the SHARD research. But OTOH, it's entirely possible they were aware. Without talking to somebody that was there, I doubt we'll ever know for sure.


Raph Koster claims parallel invention of the term "shard" for separate worlds in MMOs. https://www.raphkoster.com/2009/01/08/database-sharding-came...


They should have kept the paper as highly available replicated data.


I guess I'm not too surprised, this seems like a corporate tech report. Some companies were good at having public archives of these (like Bell Labs) but I'm sure it takes a lot of resources to keep that up. It's essentially some company's internal Wiki page.


1988? As far as anyone can tell, the use of the term "shard" in the context of database replication originated with Ultima Online, which was released in 1997, and which used the term in connection with its underlying mythos (the idea of representing world instances as shards of Mondain's shattered gem).

So a documented reference to sharding that's earlier than that would be interesting to see.

(Disagree? Instead of downvoting, consider posting a citation that actually resolves to a real paper.)


You can find reports from before 1988 mentioning SHARD being in development, like the one from June 1986 linked in this sister comment: https://news.ycombinator.com/item?id=36849634


Some more, from 1989[1][2][3]. Which again, reference the missing "SHARD" paper, but contain enough detail to make it clear that the idea of SHARD existed, regardless of the status of that particular document.

[1]: https://apps.dtic.mil/sti/tr/pdf/ADA214478.pdf

[2]: https://apps.dtic.mil/sti/tr/pdf/ADA216523.pdf

[3]: https://apps.dtic.mil/sti/tr/pdf/ADA209437.pdf


(I didn't downvote your comment)

"SHARD" is the name of the software - it was common back then to name systems using acronyms. It's not clear whether the paper/report actually uses the term "shard" in the sense that it is now used in distributed systems, or even whether it uses it at all.


One of the related papers I stumbled across, while not the SHARD paper, does go into a fair amount of detail about SHARD and the problem they were trying to address. One bit of verbiage here might be illuminating:

The new SHARD) (System for Highly Available Replicated Data) system under development at Computer Corporation of America (CCA) is designed to address the problems described above. It provides highly available distributed data processing in the face of communication failures (including network partitions). It does not guarantee serializability, nor does it preserve integrity constraints, but it does guarantee many practical and interesting properties of the database.

The reader is referred to [SBKJ for a detailed description of the architecture of the SHARI) system. Briefly the main ideas are as follows. The network consists of a collection of nodes, each of which has a copy of the complete database. (Full replication is a simplifying assumption we have used for our initial prototype, many of our ideas seem extendible to the case of partial replication, but this extension remains to be made.) Replication allows transactions to be processed locally, thus reducing communication costs and delays, and providing high availability.

So it sounds to me like their main concern was availability through replication, and not so much horizontal scalability (which seems to be more the "point" of modern day "sharding"). Yet I would probably claim that there is enough conceptual overlap to say that SHARD does relate to the modern use of sharding in some sense. Although it's hard to be sure without that original paper.


This is wild if true. Surely someone has to have a copy of this. How is it even being referenced if it is non existent?


I don't know why parent comment is stirring up drama but:

1. Not available online doesn't mean the paper's existence is made up. It's a very bold claim to make for the authors that they cite work that is fabricated.

From the available information, this looks like a technical report by a, probably now defunct, company back in the 80s. If this was its only form of publication, and not on some conference proceedings for example, it would be only found available on select university libraries as a physical copy. But most important,

2. This isn't even as an impactful paper as the parent comment states. Or if its proposed concept is, the original idea is probably derived from some other paper that is indeed the one that is highly cited and most definitely available online.

Accumulative citations number from Google Scholar and IEEEXplore doesn't exceed fifteen for the particular paper though.

https://scholar.google.com/scholar?cites=1491448744595502026...


Not available online doesn't mean the paper's existence is made up.

True, but note that the post you're referring to does say:

I've emailed dozens of academics, libraries, and archives - none of them have a copy.

So this isn't somebody just saying "I couldn't find it with Google, therefore it doesn't exist."

From the available information, this looks like a technical report by a, probably now defunct, company back in the 80s.

Yeah, I think that's the key point. An internal technical memo from a private company, from that far back, isn't likely to be easy to find. It's quite possible that it's never been digitized and put on the 'net, and it it wasn't published in a journal, it may never have been archived by any university libraries or such-like.

That said, I'd be a little surprised if a copy didn't turn up somewhere, even if it means a former employee of CCA finding a copy in a desk drawer and providing it. But who knows?


Still searching then?

https://shkspr.mobi/blog/2021/06/where-is-the-original-overv...

I can only find the Oracle reference to Sharding, which might be the same thing or not. https://docs.oracle.com/en/database/oracle/oracle-database/1...

Along with the wikipedia reference. https://en.wikipedia.org/wiki/Shard_(database_architecture)

And a Science Direct reference. https://www.sciencedirect.com/topics/computer-science/shardi...

Along with facebooks reference. https://engineering.fb.com/2020/08/24/production-engineering...

And Wolverhamptons reference to Oracle Sharding. http://ora-srv.wlv.ac.uk/oracle19c_doc/shard/sharding-overvi...

And Amazon's. https://aws.amazon.com/what-is/database-sharding/

So is the original paper a myth or was/is this demonstrating the closed circuit nature of the dissemination of knowledge?

How many different ways do you cut up the data?


It's funny in that it was such a pivotal work looking back at it. It's possible in some filing cabinet at Xerox or elsewhere (from what I gathered in the thread). That said, it may have simply seemed somewhat obvious to many at the time and nobody bothered to hold on to or archive it. I suspect it's true of a lot of now commonplace software design strategies.


> it may have simply seemed somewhat obvious to many at the time and nobody bothered to hold on to or archive it.

I suspect this has been the case for many especially last century, everything was moving so fast and there was so much choice, it really was a free for all with no major players dominating the market place and some people didnt know the significance of what they were doing or building. People then retire or get moved to different projects and the knowledge gets lost.


That's pretty interesting actually. Someone must have a copy somewhere. Seems like a real failure of scholarship if it's truly lost, and a serious argument against walled gardens-style publishing.


Have you contacted authors of papers citing the one you're looking for, especially of papers that appeared shortly after / in the 90s? Maybe one of them still has a paper copy lying around somewhere.

I was in a similar situation before with some math paper from the 50s that's nowhere to be found (neither online nor in library indices) and you'd be surprised how many professors still use paper copies.


Do Sarin, DeWitt & Rosenb[e|u]rg exist? Are they still alive? Tracking them down and going directly to the source would seem to be the way to go. Perhaps even enlisting some "big names" in the industry to ask around?


Which papers cite it? Are they old papers, when perhaps it still existed, or recent ones?

Very interesting either way!


Wow this is fascinating. Are any of these authors still alive?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: