Hacker News new | past | comments | ask | show | jobs | submit login
The Hadoop Ecosystem Table (hadoopecosystemtable.github.io)
101 points by jmngomes on Dec 22, 2015 | hide | past | favorite | 27 comments



For what it's worth, I started putting together a visualization of big data systems and how they interact. There are so many systems that it's difficult to get a grasp on how they relate to each other. I got distracted with more important things, so it's only partially complete.

http://jonathanmace.github.io/bigdatasurvey/


My recommendations...

For automated cluster building - https://ambari.apache.org/

For analysing your data, dynamically building queries and sharing this with other people in your company - https://zeppelin.incubator.apache.org/

And coming soon - https://www.zeppelinhub.com/


Ambari looks nice from the outside but it's some kind of zombie that is full of os-specific stuff, opaque puppet stuff, python, java and even brings nagios and ganglia to the party. If it works it's probably fine, have fun debugging that stuff through.

We are happy with https://github.com/saltstack-formulas/hadoop-formula and PXE booting an image.

You can configure all aspects out of a single pillar file.

Still full of warts but at least you have full control.


If you haven't had the chance yet I suggest you try Cloudera Manager & CDH instead of Ambari. I use both with clients and CM is years ahead of Ambari in terms of functionality and stability.


I looked pretty hard at Zeppelin around 6 months ago, comparing it to iPython/Jupyter for use with Spark.

I found Zeppelin hard to install (I'm a Java programmer and Zeppelin is in Scala/Java so I expected the opposite). It was also extremely buggy.

Jupyter OTOH worked straight away, and even getting Spark integration working was straight forward compared to getting Zeppelin just working.

Zeppelin looks nicer, and some of the features look great. It just isn't there for production use atm though.


What about for provisioning clusters that don't require Hadoop. I suppose this could be akin to the comment about Redis -- we're working on deploying Kafka, Storm, and Zookeeper (none of which need Hadoop), and provisioning and node management (membership, leader election) in a dynamic environment (e.g. AWS autoscaling) is not at all obvious. There's also a paucity of substantive information about scaling these clusters dynamically.


I'm especially excited about Zeppelin. Using IPython for SciPy and smaller datasets is great. I would love it for the big data space I work in and Python's tooling to come together more.


IPython/Juypter works well against Spark. We have it working in production like that, and both Google[1] and IBM[2] do the same.

[1] https://cloud.google.com/datalab/overview

[2] https://www.ng.bluemix.net/docs/services/AnalyticsforApacheS...


I recently looked into notebooks and found Beaker (http://beakernotebook.com) to be especially interesting in its support for passing data across languages.


Some of the entries in the table (e.g. Redis) seem to have nothing to do with Hadoop.


That was misleading to me as well. Lots of entries have no dependency on Hadoop stack, other than the fact that you can use them in front, on the sides, and sometimes even in lieu of Hadoop projects.


Redis has been used by at least a few people using Hadoop.

We've used it for caching intermediate results during a Spark job.


Sure. But you could have chosen _any_ datastore to cache your results in, right? Redis doesn't integrate with Hadoop in any special way. Some datastores do. For example, it does make sense to say Cassandra is a part of the Hadoop ecosystem, due to the features in https://wiki.apache.org/cassandra/HadoopSupport


I don't understand why Cascading is missing. It's by far one the easiest batch flow controllers on the platform. You can test it is memory locally. When you deploy to a real cluster, you know it will just work.


I don't agree with the last statement. Based on experience with Hadoop (over 5+ years now) running locally is a poor indicator of running on the cluster. Many sleepless night have been spent trying to figure out why the job that runs locally doesn't want to run on Hadoop.

I do like cascading and scalding tho. Only so many times you want to implement job flow, filters and joins by hand in lifez


Maybe the statement could be a bit hyperbolic, but it could be tested. I tested large complex flows locally, 59 steps, within JUnit. These test ran with every build. So the whole build took 6 minutes for 75 fraud models, but I cold easily focus on just my unit in Eclipse.

I haven't tried PigUnit for a while, but last time I did, it didn't support macros and took minutes for how has would take seconds in Cascading.

It's this difference that's cemented in my mind that Cascading is for repeatable processes while Pig is for probe ring and experimentation. This is not to say you can't reuse Pig scripts. I mean that I have greater confidence in the things I can create repeated tests for.


The ecosystem has grown so large that it is nearly impossible for anyone to have any meaningful experience with all of it. Not that it's a bad thing though, choice is always good.


Trafodion is Apache Trafodion (incubating), providing a fully distributed transactional ANSI SQL on top of HBase for OLTP and operational workloads. The link is incorrect as well. Instead use the Apache link: http://trafodion.apache.org .


Nice list, but Spark is treated superficially. Also extremely out-dated (Shark, Bagel).

SparkSQL should be in the SQL-on-Hadoop section. MLlib+ML should be in Machine Learning section. If we include Storm and Giraph, we should include SparkStreaming and GraphX.


For databases comparison, I really like Kristof Kovacs page: http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis


Great table. Under machine learning, it should include http://deeplearning4j.org. (Co-creator here.) We run on Hadoop and Spark.


You might want to look at http://db-engines.com and you'll then have plenty more DB to cover!


If the author is present, I recommend putting a clickable TOC at the top that takes you to the relevant section.


Too bad most of it is in Java.



Says something about Java aye?


Turd of a language, but very popular.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: