Launch HN: BuildFlow (YC W23) – The FastAPI of data pipelines

vosper · on March 15, 2023

Would you see Buildflow as a competitor to Dagster, Flink, or Spark Streaming?

I'm about to build a pipeline that needs to pass thousands of docs a minute through a variety of enrichments (ML models, third-party APIs, etc) and then dump the final enriched doc in ES.

There are so many pipeline products and workflow engines and MLOps solutions that I'm very confused about what technologies I should be looking at. I think something looks good (Temporal) but then read it's not really for large-volumes of streaming data. Or I look at Flink that can handle massive volumes but it doesn't seem like it's as easy to wire up as other options. I think Dagster looks nice but can't find any answer (even in their Slack) about what kind of volumes it can handle...

TankeJosh · on March 15, 2023

You can think of BuildFlow as a lightweight alternative to Flink / Spark Streaming. These streaming frameworks are great when you want to react to events in realtime (i.e. you want to trigger some processing logic every time a file is uploaded to cloud storage). Dagster is more focused on scheduling jobs, and might be a good fit if you have some batch jobs you want to trigger occasionally.

BuildFlow can run a simple PubSub -> light processing -> BigQuery pipeline at about 5-7k messages / second on a 4core VM (tested on GCP’s n1-standard-4 machines). For your case, you might be able to get away with running on a single machine with 4-8 cores.

I’d be happy to connect outside of HN if you’d like me to dig into your use case more! You can reach me at josh@launchflow.com

edit: You can also reach out on our discord: https://discordapp.com/invite/wz7fjHyrCA

vosper · on March 16, 2023

Thanks for that. Sounds like it might fit what we want. I'll reach out if I have any more questions.

Are you tied to GCP services like pubsub and BiqQuery? We're in AWS, not GCP.

TankeJosh · on March 16, 2023

AWS support is in the queue, but we only have GCP services at the moment. What services on AWS do you need access to? We can move them to the front of the queue to help out.

Feel free to reach out even if this doesn’t work with your timeline. I might be able to help you come up with another solution, and I’m always interested to hear new use cases!

faizshah · on March 15, 2023

Should we think of BuildFlow as an alternative to workflow managers like Prefect or kubeflow or is it a higher level library for stream processing like Beam?

calebtv · on March 15, 2023

More of a higher level library like Beam, and I could see it being plugged into a Prefect workflow.

faizshah · on March 15, 2023

I see, what fault tolerance mechanisms does it provide?

I don’t see anything on snapshotting or checkpointing like Flink. Is this just for stateless jobs?

calebtv · on March 15, 2023

We don't support any snapshotting or checkpointing directly in BuildFlow at the moment, but these are great features we should support.

But we do have some fault tolerance baked into our I/O operations. Specifically for Google Cloud Pub/Sub the acks don't happen until the data has been successfully processed and written to the sink, so if there is a bug or some transient failure the message will be resent later depending on your subscriber configuration.

calebtv · on March 15, 2023

I should also mention BuildFlow does support stateful processing with the Processor class API: https://www.buildflow.dev/docs/processors/overview#processor...

jcnnghm · on March 15, 2023

Is there an underlying stream processor (e.g. Flink)? How many messages per second can it process?

calebtv · on March 15, 2023

All of our processing is done via Ray (https://www.ray.io/). Our early benchmarks are about 5k mesesages per second on a single 4 core VM, but we believe we can increase the with some more optimizations.

This bench mark was consuming a Google Cloud Pub/Sub stream and outputting to BigQuery.

dmatrixjsd · on March 16, 2023

Delighted to hear your choice of Ray and building atop Ray.

lysecret · on March 15, 2023

Congrats! I had something quite similar in mind (also working a lot in python based streaming ETL). I am unfamiliar with Ray. A few questions: 1. Ray seems to be focussed on ML usecases, are you as well or are you a more generic streaming ETL framework. Can you explain the reasoning behind choosing Ray. 2. I see you deeply integrate Infrastructure (like BQ and Pub/Sub) what is your story on evolution of this infra? What happens if i have deployed infra through your code and I want to edit it? How do you deal with Dev/Prod/Qa stage divide. 3. What is your story on deployment of the "glue" code that runs your pipeline? Do you also handle multi stage pipelines?

calebtv · on March 15, 2023

Thanks! These are all great questions, apologies for the wall of text

1. We're definitely more of a generic streaming framework. But I could see ML being one of those use cases as well.

Why Ray? One of our main drivers was how "pythonic" ray feels, and that was a core principal we wanted in our framework. Most of my prior experience has been working with Beam, and Beam is great but it is kind of a whole new paradigm you have to learn. Another thing I really like about ray is how easy it is to run locally on your machine and get some real processing power. You can easily have ray use all of your cores and actually see how things scale without having to deploy to a cluster. I could probably go on and on haha, but those are the first two that come to mind.

2. We really want to support a bunch of frameworks / resources. We mainly choose BQ and Pub/Sub because of our prior experience. We have some github issues to support other resources across multiple clouds, and feel free to file some issues if you would like to see support for other things! With BuildFlow we deploy the resources to a project you own so you are free to edit them as you see fit. BuildFlow won't touch already created resource beyond making sure it can access them. In BuildFlow we don't really want to bake in environment specific logic, I think this is probably best handled with command line arguments to a BuildFlow pipeline. But happy to hear other thoughts here!

3. I'm not sure I understand what you mean by "glue", so apologies if this doesn't answer your question. The BuildFlow code gets deployed with your pipeline so it doesn't need to run remotely at all. So if you were deploying this to a single VM, you can just execute the python file on the VM and things will be running. We don't have great support for multi-stage pipelines at the moment. What you can do is chain together processors with a Pub/Sub feed. But we do really want to support chaining together processors themselves.

robertnishihara · on March 16, 2023

Congratulations on the launch, I love the focus on ease of use and making it easy to get started, and it's exciting to see impressive products being built with Ray!

I'm one of the Ray developers. It is true that Ray focuses a lot on ML applications (in particular, the main libraries built on top of Ray are for workloads like training, serving, and batch processing / inference). That said, one of our long-term goals with Ray is to be a great general-purpose way to build distributed applications, so I hope it is working out for you :)

calebtv · on March 16, 2023

Thanks for the kind words Robert! Our experience with Ray has been great so far, we're excited to see how we can use ray to help improve stream processing.

amath · on March 15, 2023

Cool, nice idea. Can you sub in different backend like bytewax (https://github.com/bytewax/bytewax) for stateful processing?

calebtv · on March 15, 2023

Thanks! Currently you can't, right now your only option is to use our ray runner. But we have talked about supporting different runner options similar to how Beam can be run on Spark, Dataflow, etc. And ultimately it would be nice if folks could implement their own runners, but I think we're still a ways out on that.

calebtv · on March 15, 2023

I should also mention BuildFlow does support stateful processing with the Processor class API: https://www.buildflow.dev/docs/processors/overview#processor...

Orangeair · on March 15, 2023

I think your site could use some copy editing. I was confused by the nested schema example [1], I don't even see NestedScema referenced after it's defined, maybe the float field should have used that type? Also noticed an instance of "BigQuer" on that page (not a particularly egregious typo, but I was on your site for all of thirty seconds).

[1] https://www.buildflow.dev/docs/schema-validation#examples

TankeJosh · on March 15, 2023

Thanks for the catch! We just pushed a fix.

0xDEF · on March 15, 2023

How does this compare with Airflow and Dagster?

calebtv · on March 15, 2023

Good question, I would say we're more focused on being a data pipeline engine as opposed to workflow orchestration. So you could use something like Airflow or Dagster to trigger your BuildFlow pipeline.

Kalanos · on March 15, 2023

by not addressing them and prefect in your initial post, it's a bit hit to credibility

TankeJosh · on March 15, 2023

We chose not to reference them because we are mainly focused on streaming use cases, which don't fit well in the prefect & dagster models.

Kalanos · on March 16, 2023

nice. that's an easy statement to make about competitive advantage since airflow explicitly states (or did state) that they are not for streaming

brap · on March 15, 2023

Congrats!

Just out of curiosity, it seems like the process function which you define has to run remotely on workers. How does it get serialized? Are there limitations to the process function due to serialization?

calebtv · on March 15, 2023

Thanks! The process function runs as a Ray Actor (https://docs.ray.io/en/latest/ray-core/actors.html). So we have the same serialization requirements as Ray (https://docs.ray.io/en/latest/ray-core/objects/serialization...)

I think the most common limitation will be ensure that your output is serializable. Typically returning python dictionaries or dataclasses is fine.

But if you had a specific limitation in mind let me know happy to dive into it!

calebtv · on March 15, 2023

One other thing I should mention that's relevent, we do also have a class abstraction instead of a decorator: https://github.com/launchflow/buildflow/blob/main/buildflow/...

This can help with things like setting up RPC clients. But it all boils down to the same runner whether you're using the class or decorator.

fhenrywells · on March 15, 2023

Do you see this as a direct competitor to Ray's built-in workflow abstraction https://docs.ray.io/en/latest/workflows/management.html

Exciting to see more libraries built on Ray in any case!

calebtv · on March 15, 2023

Great question! We actually looked at using the workflow abstraction for batch processing in our runner, but ultimately didn't because it was still in alpha (we use the dataset API for batch flows).

I think one area where we differ is our focus on streaming processing which I don't think is well supported with the workflow abstraction, and also having more resource management / use case driven IO.

fhenrywells · on March 17, 2023

Makes a ton of sense! I was present at the demo for this at last year's Ray conference and I definitely got the sense that a lot of the orchestration details were still being thought through, and that it was not yet a first-class streaming product.

Definitely like seeing more streaming-focused orchestration tools out there - it's a growing niche with not enough alternatives to Beam

TankeJosh · on March 17, 2023

We're thinking about attending this year's conference, so maybe we'll see you there :)

Kalanos · on March 15, 2023

will you support notebook execution and docker containers?

TankeJosh · on March 15, 2023

Notebook execution should already work! The flow.run(...) call returns the output collection with this use case in mind. We're currently working on a docker manager module which will let users easily dockerize / run their pipeline locally. The system (repl) debugger tool in our VSCode extension will manage all of the docker bits for the user.