For what it's worth, I started putting together a visualization of big data systems and how they interact. There are so many systems that it's difficult to get a grasp on how they relate to each other. I got distracted with more important things, so it's only partially complete.
Ambari looks nice from the outside but it's some kind of zombie that is full of os-specific stuff, opaque puppet stuff, python, java and even brings nagios and ganglia to the party. If it works it's probably fine, have fun debugging that stuff through.
If you haven't had the chance yet I suggest you try Cloudera Manager & CDH instead of Ambari. I use both with clients and CM is years ahead of Ambari in terms of functionality and stability.
What about for provisioning clusters that don't require Hadoop. I suppose this could be akin to the comment about Redis -- we're working on deploying Kafka, Storm, and Zookeeper (none of which need Hadoop), and provisioning and node management (membership, leader election) in a dynamic environment (e.g. AWS autoscaling) is not at all obvious. There's also a paucity of substantive information about scaling these clusters dynamically.
I'm especially excited about Zeppelin. Using IPython for SciPy and smaller datasets is great. I would love it for the big data space I work in and Python's tooling to come together more.
I recently looked into notebooks and found Beaker (http://beakernotebook.com) to be especially interesting in its support for passing data across languages.
That was misleading to me as well. Lots of entries have no dependency on Hadoop stack, other than the fact that you can use them in front, on the sides, and sometimes even in lieu of Hadoop projects.
Sure. But you could have chosen _any_ datastore to cache your results in, right? Redis doesn't integrate with Hadoop in any special way. Some datastores do. For example, it does make sense to say Cassandra is a part of the Hadoop ecosystem, due to the features in https://wiki.apache.org/cassandra/HadoopSupport
I don't understand why Cascading is missing. It's by far one the easiest batch flow controllers on the platform. You can test it is memory locally. When you deploy to a real cluster, you know it will just work.
I don't agree with the last statement. Based on experience with Hadoop (over 5+ years now) running locally is a poor indicator of running on the cluster. Many sleepless night have been spent trying to figure out why the job that runs locally doesn't want to run on Hadoop.
I do like cascading and scalding tho. Only so many times you want to implement job flow, filters and joins by hand in lifez
Maybe the statement could be a bit hyperbolic, but it could be tested. I tested large complex flows locally, 59 steps, within JUnit. These test ran with every build. So the whole build took 6 minutes for 75 fraud models, but I cold easily focus on just my unit in Eclipse.
I haven't tried PigUnit for a while, but last time I did, it didn't support macros and took minutes for how has would take seconds in Cascading.
It's this difference that's cemented in my mind that Cascading is for repeatable processes while Pig is for probe ring and experimentation. This is not to say you can't reuse Pig scripts. I mean that I have greater confidence in the things I can create repeated tests for.
The ecosystem has grown so large that it is nearly impossible for anyone to have any meaningful experience with all of it. Not that it's a bad thing though, choice is always good.
Trafodion is Apache Trafodion (incubating), providing a fully distributed transactional ANSI SQL on top of HBase for OLTP and operational workloads. The link is incorrect as well. Instead use the Apache link: http://trafodion.apache.org .
Nice list, but Spark is treated superficially. Also extremely out-dated (Shark, Bagel).
SparkSQL should be in the SQL-on-Hadoop section.
MLlib+ML should be in Machine Learning section.
If we include Storm and Giraph, we should include SparkStreaming and GraphX.
http://jonathanmace.github.io/bigdatasurvey/