The Hadoop Ecosystem Table

jonathan_mace · on Dec 22, 2015

For what it's worth, I started putting together a visualization of big data systems and how they interact. There are so many systems that it's difficult to get a grasp on how they relate to each other. I got distracted with more important things, so it's only partially complete.

http://jonathanmace.github.io/bigdatasurvey/

bitcointicker · on Dec 22, 2015

My recommendations...

For automated cluster building - https://ambari.apache.org/

For analysing your data, dynamically building queries and sharing this with other people in your company - https://zeppelin.incubator.apache.org/

And coming soon - https://www.zeppelinhub.com/

nisa · on Dec 23, 2015

Ambari looks nice from the outside but it's some kind of zombie that is full of os-specific stuff, opaque puppet stuff, python, java and even brings nagios and ganglia to the party. If it works it's probably fine, have fun debugging that stuff through.

We are happy with https://github.com/saltstack-formulas/hadoop-formula and PXE booting an image.

You can configure all aspects out of a single pillar file.

Still full of warts but at least you have full control.

lars_francke · on Dec 23, 2015

If you haven't had the chance yet I suggest you try Cloudera Manager & CDH instead of Ambari. I use both with clients and CM is years ahead of Ambari in terms of functionality and stability.

nl · on Dec 22, 2015

I looked pretty hard at Zeppelin around 6 months ago, comparing it to iPython/Jupyter for use with Spark.

I found Zeppelin hard to install (I'm a Java programmer and Zeppelin is in Scala/Java so I expected the opposite). It was also extremely buggy.

Jupyter OTOH worked straight away, and even getting Spark integration working was straight forward compared to getting Zeppelin just working.

Zeppelin looks nicer, and some of the features look great. It just isn't there for production use atm though.

clebio · on Dec 23, 2015

What about for provisioning clusters that don't require Hadoop. I suppose this could be akin to the comment about Redis -- we're working on deploying Kafka, Storm, and Zookeeper (none of which need Hadoop), and provisioning and node management (membership, leader election) in a dynamic environment (e.g. AWS autoscaling) is not at all obvious. There's also a paucity of substantive information about scaling these clusters dynamically.

TallGuyShort · on Dec 22, 2015

I'm especially excited about Zeppelin. Using IPython for SciPy and smaller datasets is great. I would love it for the big data space I work in and Python's tooling to come together more.

nl · on Dec 22, 2015

IPython/Juypter works well against Spark. We have it working in production like that, and both Google[1] and IBM[2] do the same.

[1] https://cloud.google.com/datalab/overview

[2] https://www.ng.bluemix.net/docs/services/AnalyticsforApacheS...

henridf · on Dec 22, 2015

I recently looked into notebooks and found Beaker (http://beakernotebook.com) to be especially interesting in its support for passing data across languages.

sciurus · on Dec 22, 2015

Some of the entries in the table (e.g. Redis) seem to have nothing to do with Hadoop.

rodionos · on Dec 22, 2015

That was misleading to me as well. Lots of entries have no dependency on Hadoop stack, other than the fact that you can use them in front, on the sides, and sometimes even in lieu of Hadoop projects.

threeseed · on Dec 22, 2015

Redis has been used by at least a few people using Hadoop.

We've used it for caching intermediate results during a Spark job.

sciurus · on Dec 22, 2015

Sure. But you could have chosen _any_ datastore to cache your results in, right? Redis doesn't integrate with Hadoop in any special way. Some datastores do. For example, it does make sense to say Cassandra is a part of the Hadoop ecosystem, due to the features in https://wiki.apache.org/cassandra/HadoopSupport

virmundi · on Dec 22, 2015

I don't understand why Cascading is missing. It's by far one the easiest batch flow controllers on the platform. You can test it is memory locally. When you deploy to a real cluster, you know it will just work.

mtanski · on Dec 23, 2015

I don't agree with the last statement. Based on experience with Hadoop (over 5+ years now) running locally is a poor indicator of running on the cluster. Many sleepless night have been spent trying to figure out why the job that runs locally doesn't want to run on Hadoop.

I do like cascading and scalding tho. Only so many times you want to implement job flow, filters and joins by hand in lifez

virmundi · on Dec 23, 2015

Maybe the statement could be a bit hyperbolic, but it could be tested. I tested large complex flows locally, 59 steps, within JUnit. These test ran with every build. So the whole build took 6 minutes for 75 fraud models, but I cold easily focus on just my unit in Eclipse.

I haven't tried PigUnit for a while, but last time I did, it didn't support macros and took minutes for how has would take seconds in Cascading.

It's this difference that's cemented in my mind that Cascading is for repeatable processes while Pig is for probe ring and experimentation. This is not to say you can't reuse Pig scripts. I mean that I have greater confidence in the things I can create repeated tests for.

Ianvdl · on Dec 22, 2015

The ecosystem has grown so large that it is nearly impossible for anyone to have any meaningful experience with all of it. Not that it's a bad thing though, choice is always good.

cjp222 · on Dec 23, 2015

Trafodion is Apache Trafodion (incubating), providing a fully distributed transactional ANSI SQL on top of HBase for OLTP and operational workloads. The link is incorrect as well. Instead use the Apache link: http://trafodion.apache.org .

mziel · on Dec 23, 2015

Nice list, but Spark is treated superficially. Also extremely out-dated (Shark, Bagel).

SparkSQL should be in the SQL-on-Hadoop section. MLlib+ML should be in Machine Learning section. If we include Storm and Giraph, we should include SparkStreaming and GraphX.

fauria · on Dec 23, 2015

For databases comparison, I really like Kristof Kovacs page: http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis

vonnik · on Dec 22, 2015

Great table. Under machine learning, it should include http://deeplearning4j.org. (Co-creator here.) We run on Hadoop and Spark.

RRRA · on Dec 23, 2015

You might want to look at http://db-engines.com and you'll then have plenty more DB to cover!

rubidium · on Dec 22, 2015

If the author is present, I recommend putting a clickable TOC at the top that takes you to the relevant section.

melted · on Dec 23, 2015

Too bad most of it is in Java.

0x54MUR41 · on Dec 24, 2015

Related discussion on HN https://news.ycombinator.com/item?id=9249913.

brianwawok · on Dec 23, 2015

Says something about Java aye?

melted · on Dec 23, 2015

Turd of a language, but very popular.