Hacker News new | past | comments | ask | show | jobs | submit login
Cayley – An open-source graph database (github.com/cayleygraph)
203 points by iamjeff on April 8, 2017 | hide | past | favorite | 68 comments



Maintainer here, good to see Cayley on HN again :)

We've got a lot of new features on master, (GraphQL support, Gephi interfaces, Recursive iterators, etc) and are cutting a release next week.

Active work in the coming releases on tightening down the indexing and really bringing it into prod.

EDIT: Feel free to join the new Slack or the Discourse mailing list/discussion board!


Do you have documentation how to get started with Cayley? I was reading this -> https://cayley.io/1-getting-started/ A little bit too short for my taste. :)



Whats the difference between Cayley and lets say OrientDb / ArangoDb ?


One of the difference is Cayley/Arangodb doesn't provide an index free adjacency (a moot point as I'm not sure if it is even worth considering as hash index are quite fast and gives O(1)) while OrientDb has IFA.


If you want help with a Cassandra backend, hit me up (my HN username @ gmail or apache)


Emailed!


What's the difference between Cayley and Neo4J?


Neo4j has control end to end. Cayley storage is handled by other databases. You can say one is a graph database and the other is a graph library on top of another database. Same as Titan now Janus.


Think about Datomic. It can use Riak as a backend or something else but it is still seen ad Datomic to the outside world. (You can also compare it to Lucene vs Solr)


Wait, Datomic with Riak as the backend? That seems a ridiculous level of abstraction, since Riak itself uses pluggable backend stores (Bitcask or LevelDB). Like creating a programming language whose interpreter runs on the JVM.


If you add a newline after the </p> in the README.md that should fix your links.

Cool to see cayley on HN again :) Pretty excited to use it some time.


Hah! Done, thank you! That probably changed when we added the Slack link


We have Elasticsearch as a generic document search engine, each document has a non-trivial amount of properties (let's say 50 or so). It's incredibly performant for all sort of searches, the details of each of which this solution wasn't specifically designed for, hence me calling it a generic search engine. Every time I contemplate of bringing graph relationships, that exist between these documents, into the mix, I get stuck. Elasticsearch doesn't quite do graph (natively), but the graph databases I tried don't do properties too well (OrientDB, Neo4J). I'm not talking about one or two properties, but multiple properties across multiple hops in the graph that I envision querying for. Let alone full text searches. I emailed back and forth with the helpful folks at Orient, but it always came down to optimising for specific queries, gone the "generic". Is anyone solving that problem? Cayley?


(author of Dgraph)

Dgraph's retrieval is pretty fast, so looking up properties is trivial. It also supports indexing various data types: supports full-text search, term matching and regexps on strings, inequality and sorting on ints, floats, dates etc. https://docs.dgraph.io/v0.7.4/query-language/#functions

One of our users is on the path to switch to Dgraph from Elastic Search. So, I'd say try Dgraph out and see if that'd help your use-case. I think it should. And if Dgraph is missing something that you need from ES, feel free to file an issue. Happy to address it.


Worth taking a look at https://arangodb.com or http://janusgraph.org/ (Fork of Titan graph db)


Sounds like something ArangoDB could be a good solution for. Full disclosure I'm from ArangoDB team and happy to help. If you like just drop me a line to jan.stuecke (at) arangodb.com


Being a graph database user, i always have to manage a replication of the "molecules"of my graph in ES for a user-friendly search experience. Ca arangodb help for such a use case? Or may be dgraph?


With ArangoDB you can choose between synchronous replication and asynchronous. With the Agency of ArangoDB you also have a RAFT based consensus protocol which holds the state of the cluster. My team mate wrote a nice article about our approach. You might want to have a look: https://www.arangodb.com/2017/01/reaching-harnessing-consens.... In single instance you have full transactional guarantees with multi collection and multi document transactions. In cluster mode we provide single document transactions. More guarantees will follow.


Dgraph does automatic replication, for providing fault tolerance. It's baked in pretty deep into Dgraph. We use consensus algorithm, so all your writes are atomically consistent (not eventually consistent). https://docs.dgraph.io/v0.7.4/deploy/


If you'd be okay with a directed acyclic graph, then SQL can work. Basically "modeling trees in SQL".

Specifically, you could use postgresql for the edge traversing and its jsonb column to store searchable attributes.


I guess you've considered doing the graph search and the text search separately and joining the results?


Yes, but I can easily reason myself into scenarios where the intermediate result would be prohibitively large.


What exactly is this? The GitHub page speaks of different backends, and those appear to just be databases or key-value stores in themselves (e.g, Postgres and Bolt).

Is Cayley basically a query rewritter, that is it has some tables in the backend and when queried, Cayley then goes to the "real" (for lack of a better word) database? Cayley's query language might be more full featured, but it isn't a storage mechanism in itself?

There are two things from that:

1. There is no way for Cayley to take the graph structure of the data into account when laying it out on disk or when executing the query. Is this the long-term decision, or is this just a stop-gap until a storage mechanism can be done?

2. This would seem to imply that the abstraction layer from Cayley to the backend storage would be relatively slim. How difficult is it to add another storage driver for another SQL database or for one with a custom query language?

Another thing I noticed:

> query -- films starring X and Y -- takes ~150ms

Even on two year old hardware that seems dog slow - less than 7 queries a second - for a very simple query.


Cayley's graph data layout is most similar to a Hexastore-style [1] triple store, though IIRC it doesn't do the full six-way index that the original Hexastore paper describes. The Redis page on secondary indexing [2] has a great quick intro to what this actually entails (search the page for Hexastore).

As you might guess from the Redis link, this style of graph lends itself well to KV stores, so the answer to your question #1 might be that it's a long-term decision, but the style of graph is really designed for a KV store anyway. But I haven't discussed this at all with the Cayley devs so I can't actually speak for them.

I'm using it with the BoltDB backend and have been pleased with the performance overall. I haven't looked at the backends for more complex databases like Postgres in detail, but it does appear that the backend interface has potential for predicate pushdown as well. The repository's graph directory [3] contains the various backends if you want to check it out. Overall it doesn't look very difficult to add another backend type, but I haven't tried it yet. Looking at the existing SQL backend, it appears to already support MySQL, PostgreSQL, and CockroachDB (but I've tried none of these with Cayley).

[1] http://www.vldb.org/pvldb/1/1453965.pdf [2] https://redis.io/topics/indexes [3] https://github.com/cayleygraph/cayley/tree/master/graph


Speaking of which, take a look at this Redis module that marries Hexastore and neo4j-like queries: https://github.com/RedisLabsModules/redis-module-graph


You can't derive rate from latency alone. The query might be IO bound. If that's the case, you can run queries concurrently.


I tend to care more about latency than throughput. FedEx has the best throughput in the world if you can wait a couple days.

True though. I'm guessing with a latency number like that, the throughput is pretty bad too.


Exactly what https://dgraph.io is built for. Low latency, high throughput.


Looks interesting.

That semi-declarative query language is hideous though. Reminds me of attempts to make XML into a programming language.


(author of Dgraph)

It's based on GraphQL, which is definitely not a hack like XML => language and currently catching on faster than a forest fire. Dgraph's derivation, GraphQL+- (for lack of better name) is a lot more powerful than Cypher or Gremlin. Both of the latter allow only returning lists of results; while GraphQL+- returns back an entire subgraph. Thus, all the relationships are maintained. It also allows expressing complex joins using a function like variable blocks; feeding results from one into another like you'd do in any popular language.

You can try running some complex queries with GraphQL+-, it might change your view. http://play.dgraph.io/


Cool af

Edit: I still think it looks hideous. Def try to figure it out more tomorrow.


Good work! An alternative is Dgraph https://dgraph.io/ which I am considering for my next project.


Note that they changed the license from APL to AGPL just recently (13 days ago, based from their commits)


Here are my experiences with Cayley. 100% positive for building graph microservices.

1. Use Cayley as a library 2. Put metadata in separate nodes.


This is the approach I ended up using too. Works great.


Anyone use Cayley in prod? An old job used Neo4j, and the graph concept was great for specific use cases. As a lightweight graph store, Cayley was really exciting when it came out, but I haven't had a need for it since I left that job. It strikes me as really well made, and I'd love to hear any war stories.


Tried to use it in production a couple of years ago hosting a mirror copy of Freebase with mixed results:

- There were a couple of issue loading the data that we fixed and contributed back the patch

- Loading the data was really slow, and it got slower every time a new entry was added (Loading the full freebase dump required 1 week on a very beefy machine with SSD. Used LevelDB)

- Then the queries were relatively slow. Without going too much into details, we were using the data to analyze texts and extract entities, and the relationship between them, and even parallelizing the queries, they were relatively slow (depending on complexity between 0.1 and 1 sec on average). We solved the issue implementing a robust caching layer in front of it and carefully planning the queries.

- In general, it was stable and performant enough for a backend service. But we were pushing really the envelope of what it could do.

All in all, I would say that I was happy with it. In comparison, I tried a year earlier to use Neo4J in a similar role and I give up after 2 weeks because I wasn't even able to get it loading part of the dataset without crashing on a similar hardware.


What's the best way to load Freebase in 2017? Cayley with Postgres storage? Or some other RDF/graph DB? Or ElasticSearch? Or dump it in Postgres/MySQL? I am not interested in complex queries, but simple queries that execute reasonable fast.


We have it loaded on a Dgraph instance. In case you want to play around with it at https://play.dgraph.io


The movie subset, or the whole Freebase?

The Freebase Film Data has only 21M facts. Freebase 1.9 billion facts.


This is just the film data.


I would be interested if Dgraph can handle the full Freebase dataset. (250 GB RDF)

How long does it load? What's the avg query response for very simple searches (like who is the US president)?


(Dgraph author) That's a good point. I think I'll load one instance up with the entire Freebase data, run it on freebase.dgraph.io, and blog about how and whys etc. Expect that in the next couple of weeks.


How is Dgraph licensed? I see both Apache and AGPL in GitHub.


Dgraph follows MongoDB licensing. The clients are all in Apache, and the server code is AGPL. This doesn't affect anyone using Dgraph for commercial purposes; but if they make changes to the server code, they'll have to release them under AGPL. Blog post here: https://open.dgraph.io/post/licensing/


Looking at the commit, they switched from asl to agpl.


Benchmarks for loading freebase data in Cayley vs Dgraph. https://discuss.dgraph.io/t/differences-between-dgraph-and-c...

Dgraph was 10X faster.


What's up with this toy dataset? The movie subset is just 21 mio facts. (21million.rdf.gz)

Can someone run the benchmark for the real Freebase (1.9 billion facts)?

Also LevelDB/Bolt is not suitable for this, better use MongoDB or Postgres or MySQL as Cayley data store.


Expect freebase.dgraph.io in a couple of weeks.


I can't understand how to use the query language. It all seems so magical!

I tried building something with Cayley once but couldn't fetch all the data I wanted in a single query, or didn't know how to, then got frustrated and deleted everything.


Feel free to ask more questions through whatever channel you like; we need better docs for sure, but if you're lost we have a really friendly community that's happy to help.


Which of the three query languages are you having trouble with? All of them? MQL has been around a long long time (2006). Gizmo is new but based on & very similar to Gremlin (2009). GraphQL is the newest (2015). Did you try them all? Or is one in particular rough?


I'm talking about that Gizmo/Gremlin.


Very nice! any plans to use it as a backend for google's badwolf[1] (a temporal graph store)?

https://github.com/google/badwolf


The thing badwolf brings to the table (and respect to the author -- super nice fellow) is adding metadata (namely, a timestamp) to the links.

The topic of 'reification' found throughout our recent discussion is how we can generally add metadata to links, thereby making it a lot easier to fit the two models together.


awesome :) thanks


Being on HN while doc is lorem ipsum (https://cayley.io/1-getting-started/), damned !


Yeah, we're still getting the marketing site up, complete with docs. Til then, there's https://github.com/cayleygraph/cayley/tree/master/docs with the content


There is an asterisk in the docs behind "inspired" but no footnote for it. What does it mean?


Can anyone give concrete examples of datasets that are better suited for a graph database and why?


Anything Social. Product or Person hierarchies. Network datasets. Ancestry (genetic or data), etc.

They are better suited for Graph Databases because the queries tend to be many joins traversing paths both deep and wide.


Any screenshots of the visualizer?


How matured is it? Has someone used it for big datasets? Last time I tried it wasn't ready to cope with a 250 GB N-Triples RDF. (two years ago)


I used it in the past with the Freebase dump and was amazingly stable (also two years ago). I was using LevelDB on a pretty beefy machine. The big issue was the time for loading the data. At the time the MongoDB backend wasn't good enough. I posted my experience in a response in the thread.


I see Cayley now supports other backends beside LevelDB. "PostgreSQL and MongoDB for distributed stores" - that's good to read.


is there a description of the RDBMS low-level model? Is it something like a single s,p,o table, with indexes (s, so, spo, , po, etc)?


Exactly so, for the RDBMS model. Yeah, this has it's own issues, but it's the most direct method. We do a little extra trick with the indexing; joining on fixed hashes instead of the full value, but nothing crazy.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: