Hacker News new | past | comments | ask | show | jobs | submit login
Titan 0.3.0 Released: Geo, full-text, edge indexing on billion edge graphs (github.com/thinkaurelius)
76 points by okram on March 29, 2013 | hide | past | favorite | 29 comments



Titan is a new real-time, distributed, transactional graph database that can use either Cassandra or HBase as its distributed data store.

Titan 0.3 was stressed tested with Cassandra at 120 billion edges and is capable of loading 1.2 million edges per second on a 16 machine hi1.4xl cluster (https://twitter.com/aureliusgraphs/status/316255164719828992).

This release provides a complete performance-driven redesign of many core components, and the primary new feature is advanced indexing.

Here are the new indexing features:

* Geo: Search for elements using shape primitives within a 2D plane.

* Full-text: Search elements for matching string and text properties.

* Numeric range: Search for elements with numeric property values using intervals.

* Edge: Edges can be indexed as well as vertices.

See http://thinkaurelius.com/news/


Hey espeed, how well would the SybilGuard[0] algorithm run on Titan? Right now, we're evaluating Twitter's in-memory graph library (Cassovary[1]), but that obviously requires a big machine..., and we already use Cassandra elsewhere, so would prefer to stay with that.

Up to, say, 5 second response times on large graphs would be acceptable.

[0] http://www.math.cmu.edu/~adf/research/SybilGuard.pdf

[1] https://github.com/twitter/cassovary


Titan is one piece of the Aurelius Graph Cluster (http://thinkaurelius.com/subscription/).

1. Titan is the OLTP piece, and it's very fast at running local-rank algorithms (http://markorodriguez.com/2011/03/30/global-vs-local-graph-r...).

2. Faunus is an OLAP graph-analytics engine that integrates Titan with Hadoop for global analysis of Titan graphs. Graphs are analyzed using a MapReduce implementation of the Gremlin graph traversal language. General use-cases include computing graph derivations/transformations and global graph statistics. You can then feed the global-algo results back into Titan.

3. Fulgora is an in-memory, compression-based, transaction-less OLAP graph processor capable of storing billions of edges within the memory confines of a single machine. Fulgora is optimized for the execution of massively threaded, global graph algorithms. It will come out later this year, and you can connect it to Faunus or feed it directly from Titan.

If you can construct a local SybilGuard algo, you can run it in Titan and get an immediate response. Otherwise, for global-graph algos, you would feed Faunus from Titan directly and query Faunus' in-memory graph. There are also things in the works that will blur these distinctions. More details to come later this month in a series of blog posts -- stay tuned.

Marko or Matthias will have more insight on how to best run a SybilGuard-type algo. Right now they're about to jump on a flight to Austin for Data Day Texas (http://datadaytexas.com/), but I'm sure they'll respond when they have a free moment.

See Marko's YOW! interview (http://channel9.msdn.com/posts/YOW-2012-Marko-Rodriguez-Grap...) and Matthias' Titan/Cassandra talk (http://www.youtube.com/watch?v=ZkAYA4Kd8JE) for more details on the architecture.


CORRECTION -- that should say: "Otherwise, for global-graph algos, you would feed Fulgora from Titan directly and query Fulgora's in-memory graph" (sorry if I confused anyone).


Very exciting stuff! Looking forward to seeing you guys at DDT.


Thanks espeed, the links were especially helpful.

I'm looking forward to Fulgora, sounds great!


No problem. I sent a tweet to @erichocean -- is that you? If you want, let's chat about algo options next week when everyone is back in town.


Excellent documentation - this is what a 'getting started' page should look like. All projects should have this depth of introductory material.


One of the most interesting part seems not to be mentioned: Apache license.

So far the only real, all-features-included graph database with a permissive open source license, - or am I missing something?


Yeah, it's Apache 2.0 (https://github.com/thinkaurelius/titan/blob/master/LICENSE.t...), and it's the first native Blueprints implementation so it integrates with the entire TinkerPop stack (http://www.tinkerpop.com/), which is BSD.


Well, I am not a Big Data guy, so I don't know if it's "real" enough in terms of capabilities or maturity, but when looking for a graph database to mess around with, I came across OrientDB, which also uses the Apache license. Clearly it's not built for the same use cases or scale as Titan but it seems to have some case studies and commercial support. I haven't played with it yet, so I suppose the project could be an elaborate hoax.

http://www.orientdb.org/


No, OrientDB is not an "elaborate hoax" :) -- it's very real and developed by Luca Garulli (https://github.com/lvca). OrientDB is one of the primary Blueprints implementations (https://github.com/tinkerpop/blueprints) so it integrates with the TinkerPop stack as well.


Apache Jena is a graph database and has been an Apache project since 2011 and graduated from the incubator in Apr 2012

http://jena.apache.org/


One tangential direction we want to go with Titan is benchmarking it as a distributed RDF store. With edge indexing now in Titan 0.3.0 and Blueprints Sail (GraphSail), Titan can represent an RDF graph and be queried using SPARQL.

   https://github.com/tinkerpop/blueprints/wiki/Sail-Ouplementation


I'd love to see more detailed write up about the performance. I'm working on a natural language parsing problem and have had some success using graphs to perform chunking in the past.

+1 for using Gremlin! Do you know of any python implementations of it?


I'd love to see more detailed write up about the performance.

Over the next month there will be a series of blog posts detailing the performance numbers and the key concepts behind Titan's design and architecture.

This blog post from August 2012 (http://thinkaurelius.com/2012/08/06/titan-provides-real-time...) has some details about early tests from Titan 0.1, but notice that today's Titan 0.3 numbers dwarf the 0.1 numbers (the details of how this was accomplished will be explained in the upcoming blog posts).


+1 for using Gremlin! Do you know of any python implementations of it?

Note that Marko is the creator of Gremlin :)

There are Gremlin implementations in various stages of development for almost every major JVM language:

Gremlin-Java (base implementation) - https://github.com/tinkerpop/gremlin/wiki/Using-Gremlin-thro...

Gremlin-Groovy (original) - https://github.com/tinkerpop/gremlin/wiki/Using-Gremlin-thro...

Gremlin-JavaScript - https://github.com/entrendipity/gremlin-js

Gremlin-Clojure - https://github.com/zmaril/ogre

Gremlin-Scala - https://github.com/mpollmeier/gremlin-scala

Within the next year (before the TinkerPop book comes out), it would be cool to have all the languages covered, including Gremlin-Jython and Gremlin-JRuby.

Gremlin-Java and Gremlin-Groovy are maintained by TinkerPop.

Gremlin-JavaScript, Gremlin-Clojure, and Gremlin-Scala are being developed/maintained by community members.

To create a Gremlin implementation, you essentially wrap Gremlin-Java in the target language's idiomatic style.

If you are interested in helping develop Gremlin-Jython or Gremlin-JRuby (or any other implementation currently in development), please post to the Gremlin Users Group (https://groups.google.com/forum/?fromgroups=#!forum/gremlin-...).

Right now most people use Gremlin-Groovy (it's the original) regardless of what language they're developing in (think of Gremlin as a domain-specific language like SQL you use in conjunction with your primary language).

For example, Bulbs (http://bulbflow.com) is a Python library I wrote that supports Rexster, Neo4jServer, and Titan. In Bulbs, you edit Gremlin scripts in Groovy text files, and when you create a Python Graph object, Bulbs sources your Groovy files and caches the Gremlin scripts in a library so they are readily available for when you want to execute them on the server.

See http://bulbflow.com/docs/api/bulbs/rexster/gremlin/


> Gremlin-Clojure, and Gremlin-Scala are being developed/maintained by community members

Would be nice if other Groovy-based products like Gradle enabled a Scala and/or Clojure frontend to appeal to us who are fussier about what shell language we use.


Really happy to see this. Testing with Titan at the moment and very happy with it so far.


The coolest thing about this is the getting started narration. I love Greek mythology!


Comparison with neo4j?


I haven't tried Titan, but I've played around with neo4j a little bit in the past and found it a little frustrating to get started with.

The good part is the ACID transactions and the (beautiful) browser based interface to it. The bad part is that it was far too easy to get the database in a locked state when you perform a bad query (which you tend to do a lot as you're learning).

Graph databases are extremely powerful and will likely change the way you look at structuring a lot of data mining problems. I'd recommend trying one of them out if you ever get the chance!


on Titan you can choose your storage backend (currently, BerkeleyDB, Cassandra and HBase) according to your needs: https://github.com/thinkaurelius/titan/wiki/Storage-Backend-...


Titan is distributed (but can be run in single-server mode). Neo4j is master/slave.


Also note that neo4j requires a license to run in high-availability/multi-server mode.


I'd love to see this from someone outside the project- it's pretty clear from blog posts, etc that Titan was made to handle larger graphs, but what are the tradeoffs, and where are the pain points?


I am not outside the project, but I use Titan/Faunus with clients and here are the pain points I notice.

1. Difficult to deal with clusters: while you can run Titan single machine (and that is easy to deal with), handling a multi-machine cluster can be frustrating and requires some DevOps skills. Pull in Faunus/Hadoop and you are in a world unto its own.

2. Lots of options: with Titan there are numerous configurations and you get into this "n choose m" scenario. However, I typically just run Titan/CassandraEmbedded + (now) Elastic Search. For instance, I gave up on Titan/BerkeleyDB. Decided that Titan/CassandraEmbedded was sufficient for most needs and thats that.

3. Lots of access points into Titan/TinkerPop: You can run Blueprints Java native, Gremlin, Rexster RexPro, Rexster REST, ... Unfortunately, depending on your situation, one approach is better than another. I typically do Rexster Extensions + Gremlin. In the future though, with RexPro (haven't gotten into it personally), I will probably just use RexPro as I can send Gremlin in and get results out.

4. There are so many bodies of code: Titan is Apache2. This is good. It feeds on alot of excellent open source projects -- HBase, Cassandra, Lucene, ElasticSearch, Gremlin, Blueprints, Rexster, Frames, ... There is no one source of documentation. I'm either reading TinkerPop documentation or Titan documentation. Then when there are issues, Googling the Cassandra mailing list. Luckily, I'm an expert in TinkerPop and since Titan is native TinkerPop my knowledge transfers. However, there are lots of peculiarities (Titan typing, configurations, exceptions, cluster setup etc.) that I have to learn and master.

I am fortunate enough to work on projects that use Titan/TinkerPop and thus, as I see these pain points I continually work to solve them. Over time, we will see much of these complexities hidden. Right now, Titan is a young project and some good packaging polish will happen with good time.


Any plans to support other storage backends? Postgresql and Riak comes in my mind.


Anyone can implement new Titan storage adapters by implementing a few interface classes.

Look at the Titan/BDB storage adapter for a simple example:

https://github.com/thinkaurelius/titan/tree/master/titan-ber...

It implements the KeyValueStore interfaces (there are other interfaces for different types of DBs, such as KeyValueColumnStore, etc):

https://github.com/thinkaurelius/titan/blob/master/titan-cor...

https://github.com/thinkaurelius/titan/blob/master/titan-cor...

https://github.com/thinkaurelius/titan/blob/master/titan-cor...

If you are interested in implementing a Titan storage adapter for a new backend datastore and you have questions, you can discuss it in the Aurelius Graphs group:

https://groups.google.com/forum/?fromgroups#!forum/aureliusg...




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: