Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I think that this is useful.

One question though. Is there a reason you can't use the Spark Window functions? https://databricks.gitbooks.io/databricks-spark-reference-ap...



I haven't studied the code very carefully, but I think this is mostly PR piece. Not only using Spark from Java isn't a very good idea, but the "canonical" way to do transformations like that is using dataframes or datasets (which are from Spark 1.6 and provide some improvements over dataframes).

Take it with a grain of salt.


You're not going to get clean out-of-order processing semantics with any mode of Spark transformations. If you actually take the time to read the article, there's a section discussing the Java/Scala angle. The difference in code size is really secondary (though it is a difference). The difficulty in maintaining and evolving your pipeline over time using Spark is the main point, given the way important concepts become conflated with their API (any version of it). This all comes across much more clearly for those that actually take the time to read the words.


> If you actually take the time to read the article, there's a section discussing the Java/Scala angle.

They claim this isn't about the length of code, yet they select the most verbose way to use Spark and proudly display how long it is.

I mean sure, lack of event-time based processing is known limitation of Spark (and a pretty annoying one - though it is supposed to be worked on) but there are ways to write about it without code made to look bad on purpose.

EDIT: come to think of it, this whole article is "spark streaming can't do event time" written in thousands of words with contrived examples attached.


I think both length of code is a side effect that results from the primary argument, and "cant do event time" is one of the symptoms. Neither is a primary argument in the blog post.

The primary argument is demonstrated through color coding different logical bits, which end up being clearly portable and elegantly distinct in dataflow.

This is demonstrated in two ways:

1. The "juicy value add" code that does the aggregation is labeled yellow, and doesn't change across all the samples with Dataflow. With Spark, it needs to be rewritten for every use case. Similarly, for all colors.

2. In Dataflow all the colors are separate. This makes expressing your logic easier. In Spark, the colors mix in dramatic ways with every demonstrated use case.

As Tyler said, all this is described in the blog post itself, but I don't blame you for missing it, since it's a really long post :)


What do you think about Cloudera's Spark backend to Google Cloud Dataflow pipelines[1]?

[1]: https://github.com/cloudera/spark-dataflow


Hi, I'm also an engineer on the Dataflow team. This is a very exciting project! However it is still constrained by Spark's limitations outlined in this post in terms of what Dataflow pipelines it can run. For example, currently it doesn't support session windows, it windows based on processing time instead of event time, and it doesn't seem to support triggers (the word "trigger" doesn't appear in the code base). AFAIK Spark has plans to add support for event time, and then it will be possible to make this runner run more pipelines consistently with the semantics of the Dataflow model. It is also likely that the capabilities of this runner will expand greatly as https://wiki.apache.org/incubator/BeamProposal progresses.


I think using Java as a comparison is fine. It's an officially supported platform, and while Scala is better we do a lot of work using Java without problems.

The Datasets thing is more interesting. There is no doubt that the way they unify the Dataframe/RDD programming model is better, but it is so new (1.6 only) we certainly haven't migrated to it yet. The documentation isn't huge, either: http://spark.apache.org/docs/latest/sql-programming-guide.ht...


I can't remember Google ever giving an open source project a kicking like it feels they've been giving Apache Spark recently, seems very unlike them.


It's less Google and more Google engineers, and from that standpoint it is totally normal fire them to kick the living daylights out of any code, particularly Google code.

Spark streaming really feels operationally immature compared to a lot of other stream processing frameworks, even Dataflow. The criticism is both unsurprising and warranted.


What other stream processing frameworks do you prefer in the place of spark streaming?Storm is pretty mature, but does not play very well with YARN last I tried. Flink looks pretty good, but is fairly new. Samza is another one. I'm curious to know if you have any specific issues with spark streaming's operational immaturity. Some things I don't like in general are: # Backpressure algo is fairly new, but pluggable # Does not handle stragglers very well, inspite of back pressure - this is due to treating everything as batch # Events from the system are not very rich and cannot be customized. # Error handling is very unclear, and does not offer a lot of flexibility

In spite of these shortcomings, it has pretty good Kafka integration, mostly uses the same paradigm as batch and plays well with hadoop infrastructure. Makes it a decent choice for many use cases


Storm has the community around it, though it shows signs of decline (perhaps the release of YARN-friendly Heron will help). There are a lot of suitors for developer affection. Fink is relatively green. Samza looks good, but is still picking up momentum in the community. DataTorrent seems to have some moment with its Apache Apex.

Regardless, most of these stream processing frameworks are still very much in early days and lack a lot of the sophistication you find in custom in house systems (such as found at... Google ;-). The open source world will no doubt catch up and overtake those systems, but right now there is still enough of a gap that it is rather painful.

Spark streaming pains:

1. Backpressure & stragglers. Duh. 2. Setup & tear down is still rough, even compared to Storm. 3. The whole context singleton thing means you need a new VM for each job, which annoys the #@$@#$ out of me. 4. Error handling isn't just unclear, it's kind of disastrous. 5. You can feel its "batch" heritage in lots of places, not just the stragglers. For some that is a feature, for me, a bug, even though with Storm I use Trident. 6. When a job runs amuck, it's a pain to recover from it. Storm is no picnic either, but it is indeed better.


When there's an obvious misrepresentation of the competition like in this piece, it's a big red flag for me. I think it hurts PR in the end.


Spark windowing functions (like pretty much all notions of time in Spark streaming) are based on processing time, not event time.


So weird/annoying that it isn't at least event count, if not event time.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: