I switched from web dev to data science some years ago, and surprisingly couldn't find a streaming parallelizer for Python -- every package assumes you loaded the whole dataset in memory. Had to write my own.
Same in video parsers and tooling frequently, expects a whole mp4 to be there, or a whole video to parse it, yet gstreamer/ffmpegapi delivers the content as a stream of buffers that you have to process one buffer at a time.
Traditionally, ffmpeg would build the mp4 container while transcoded media is written to disk (in a single contiguous mdat box after ftyp) and then put the track description and samples in a moov at the end of the file. That's efficient because you can't precisely allocate the moov before you've processed the media (in one pass).
But when you would load the file into a <video> element, it would off course need to buffer the entire file to find the moov box needed to decode the the NAL units (in case of avc1).
A simple solution was then to repackage by simply moving the moov at the end of the file before the mdat (adjusting chunk offset). Back in the day, that would make your video start instantly!
This is basically what cmaf is. the moov and ftyp gets sent at the beginning (and frequently gets written as an init segment) and then the rest of the stream is a continuous stream of moof's and mdat's chunked as per gstreamer/ffmpeg specifics.
I was thinking progressive MP4, with sample table in the moov. But yes, cmaf and other fragmented MP4 profiles have ftyp and moov at the front, too.
Rather than putting the media in a contiguous blob, CMAF interleaves it with moofs that hold the sample byte ranges and timing. Moreover, while this interleaving allows most of the CMAF file to be progressively streamed to disk as the media is created, it has the same CATCH22 problem as the "progressive" MP4 file in that the index (sidx, in case of CMAF) cannot be written at the start of the file unless all the media it indexes has been processed.
When writing CMAF, ffmpeg will usually omit the segment index which makes fast search painful. To insert the `sidx` (after ftyp+moov but before the moof+mdat s) you need to repackage (but not re-encode).
It is possible that this is not a fault of the parser or tooling. In some cases, specifically when the video file is not targeted for streaming, the moov atom is at the end of the mp4. The moov atom is required for playback.
That's intentional, and it can be very handy. Zip files were designed so that you make an archive self-extracting. They made it so that you could strap a self-extraction binary to the front of the archive, which - rather obviously - could never have been done if the executable code followed the archive.
But the thing is that the executable can be anything, so if what you want to do is to bundle an arbitrary application plus all its resources into a single file, all you need to do is zip up the resources and append the zipfile to the compiled executable. Then at runtime the application opens its own $0 as a zipfile. It Just Works.
Also, it makes it easier to append new files to an existing zip archive. No need to adjust an existing header (and potentially slide the whole archive around if the header size changes), just append the data and append a new footer.
I’ve found the Rust ecosystem to be very good about never assuming you have enough memory for anything and usually supporting streaming styles of widget use where possible.
ha! I was literally thinking of the libs for parsing h264/5 and mp4 in rust (so not using unsafe gstreaer/ffmpeg code) when moaning a little here.
Generally i find the rust libraries and crates to be well designed around readers and writers.
My experience that played out over the last few weeks lead me to a similar belief, somewhat. For rather uninteresting reasons I decided I wanted to create mp4 videos of an animation programmatically.
The first solution suggested when googling around is to just create all the frames, save them to disk, and then let ffmpeg do its thing from there. I would have just gone with that for a one-off task, but it's a pretty bad solution if the video is long, or high res, or both. Plus, what I really wanted was to build something more "scalable/flexible".
Maybe I didn't know the right keywords to search for, but there really didn't seem to be many options for creating frames, piping them straight to an encoder, and writing just the final video file to disk. The only one I found that seemed like it could maybe do it the way I had in mind was VidGear[1] (Python). I had figured that with the popularity of streaming, and video in general on the web, there would be so much more tooling for these sorts of things.
I ended up digging way deeper into this than I had intended, and built myself something on top of Membrane[2] (Elixir)
It sounds like a misunderstanding of the MPEG concept. For an encode to be made efficiently, it needs to see more than one frame of video at a time. Sure, I-frame only encoding is possible, but it's not efficient and the result isn't really distributable. Encoding wants to see multiple frames at a time so that the P and B frames can be used. Also, to get the best bang for the bandwidth buck is to use multipass encoding. Can't do that if all of the frames don't exist yet.
You have to remember how old the technology you are trying to use is, and then consider the power of the computers available when they were made. MPEG-2 encoding used to require a dedicated expansion card because the CPUs did have decent instructions for the encoding. Now, that's all native to the CPU which makes the code base archaic.
No doubt that my limited understanding of these technologies came with some naive expectations of what's possible and how it should work.
Looking into it, and working through it, part of my experience was a lack of resources at the level of abstraction that I was trying to work in. It felt like I was missing something, with video editors that power billion dollar industries on one end, directly embedding ffmpeg libs into your project and doing things in a way that requires full understanding of all the parts and how they fit together on the other end, and little to nothing in-between.
Putting a glorified powerpoint in an mp4 to distribute doesn't feel to me like it is the kind of task where the prerequisite knowledge includes what the difference between yuv420 and yuv422 is or what Annex B or AVC are.
My initial expectation was that there has to be some in-between solution. Before I set out, what I had thought would happen is that I `npm install` some module and then just create frames with node-canvas, stream them into this lib and get an mp4 out the other end that I can send to disk or S3 as I please.* Worrying about the nitty gritty details like how efficient it is, many frames it buffers, or how optimized the output is, would come later.
Going through this whole thing, I now wonder how Instagram/TikTok/Telegram and co. handle the initial rendering of their video stories/reels, because I doubt it's anywhere close to the process I ended up with.
* That's roughly how my setup works now, just not in JS. I'm sure it could be another 10x faster at least, if done differently, but for now it works and lets me continue with what I was trying to do in the first place.
This sounds like "I don't know what a wheel is, but if I chisel this square to be more efficient it might work". Sometimes, it's better to not reinvent the wheel, but just use the wheel.
Pretty much everyone serving video uses DASH or HLS so that there are many versions of the encoding at different bit rates, frame sizes, and audio settings. The player determines if it can play the streams and keeps stepping down until it finds one it can use.
Edit:
>Putting a glorified powerpoint in an mp4 to distribute doesn't feel to me like it is the kind of task where the prerequisite knowledge includes what the difference between yuv420 and yuv422 is or what Annex B or AVC are.
This is the beauty of using mature software. You don't need to know this any more. Encoders can now set the profile/level and bit depth to what is appropriate. I don't have the charts memorized for when to use what profile at what level. In the early days, the decoders were so immature that you absolutely needed to know the decoder's abilities to ensure a compatible encode was made. Now, the decoder is so mature and is even native to the CPU, that the only limitation is bandwidth.
Of course, all of this is strictly talking about the video/audio. Most people are totally unawares that you can put programming inside of an MP4 container that allows for interaction similar to DVD menus to jump to different videos, select different audio tracks, etc.
> This sounds like "I don't know what a wheel is, but if I chisel this square to be more efficient it might work". Sometimes, it's better to not reinvent the wheel, but just use the wheel.
I'm not sure I can follow. This isn't specific to MP4 as far as I can tell. MP4 is what I cared about, because it's specific to my use case, but it wasn't the source of my woes. If my target had been a more adaptive or streaming friendly format, the problem would have still been to get there at all. Getting raw, code-generated bitmaps into the pipeline was the tricky part I did not find a straightforward solution for. As far as I am able to tell, settling on a different format would have left me in the exact same problem space in that regard.
The need to convert my raw bitmap from rgba to yuv420 among other things (and figuring that out first) was an implementation detail that came with the stack I chose. My surprise lies only in the fact that this was the best option I could come up with, and a simpler solution like I described (that isn't using ffmpeg-cli, manually or via spawning a process from code) wasn't readily available.
> You don't need to know this any more.
To get to the point where an encoder could take over, pick a profile, and take care of the rest was the tricky part that required me to learn what these terms meant in the first place. If you have any suggestions of how I could have gone about this in a simpler way, I would be more than happy to learn more.
using the example of ffmpeg, you can use things like -f in front of -i to describe what the incoming format is so that your homebrew exporting can send to stdout piped to ffmpeg where reads from stdin with '-i -' but more specifically '-f bmp -i -' would expect the incoming data stream to be in the BMP format. you can select any format for the codecs installed 'ffmpeg -codecs'
In a way, that's good. The few hundred video encoding specialists who exist in the world have, per person, had a huge impact on the world.
Compare that to web developers, who in total have had probably a larger impact on the world, but per head it is far lower.
Part of engineering is to use the fewest people possible to have the biggest benefit for the most people. Video did that well - I suspect partly by being 'hard'.
There are many packages that can do that, like Vax [1] and Dask [2]. I don't know exactly your workflow. But the concurrency in python is limited to multiprocessing, which is much expensive than threads which usually a typical streaming parallelizer will use outside python world.
I've looked into the samples and recall what the problem was: geopandas was in an experimental branch, and you had to lock yourself into dask -- plus, the geopandas code had to be rewritten completely for dask. So i wrote my own processor that applies the same function in map&reduce fashion, and keeps code compatible with jupyter notebooks -- you decorate functions to be parallelizeable, but still import them and call normally.
https://github.com/culebron/erde
This is actually the old SAX vs DOM xml parsing discussion in disguise.
SAX is harder but has at least two key strongly related benefits: 1. Can handle a continuous firehose 2. Processing can start before the load is completed (because it might never be completed) so the time to first useful action can be greatly reduced.
> On an average desktop/server system the OS would automatically take care of putting whatever fits in RAM and the rest on disk.
This is not true, unless you’re referring to swap (which is a configuration of the system and may not be big enough to actually fit it either, many people run with only a small amount of swap or disable it altogether.)
You may be referring to mmap(2), which will map the on-disk dataset to a region of memory that is paged in on-demand, but somehow I doubt that’s what OP was referring to either.
If you just read() the file into memory and work on it, you’re going to be using a ton of RAM. The OS will only put “the rest on disk” if it swaps, which is a degenerate performance case, and it may not even be the dataset itself that gets swapped (the kernel may opt to swap everything else on the system to fit your dataset into RAM. All pages are equal in the eyes of the virtual memory layer, and the ranking algorithm is basically an LRU cache.)
Fair enough, it’s totally possible that’s what they meant. But the complaint of “every package assumes you loaded the whole dataset in memory” seems to imply the package just naively reads the file in. I mean, if the package was mmapping it, they probably wouldn’t have had much trouble with memory enough for it to be an issue they’ve had to complain about. Also, you may not always have the luxury of mmap()’ing, if you’re reading data from a socket (network connection, stdout from some other command, etc.)
I don’t do much python but I used to do a lot of ruby, and it was rare to see anyone mmap’ing anything, most people just did File.read(path) and called it a day. If the norm in the python ecosystem is to mmap things, then you’re probably right.
That’s a really misleading thing to say. If the kernel already has the thing you’re read()’ing cached, then yes the kernel can skip the disk read as an optimization. But by reading, you’re taking those bytes and putting a copy of them in the process’s heap space, which makes it no longer just a “cache”. You’re now “using memory”.
read() is not mmap(). You can’t just say “oh I’ll read the file in and the OS will take care of it”. It doesn’t work that way.
> If the kernel already has the thing you’re read()’ing cached
which it would do, if you have just downloaded the file.
> But by reading, you’re taking those bytes and putting a copy of them in the process’s heap space
i mean, you just downloaded it from the network, so unless you think mmap() can hold buffers on the network card, there's definitely going to be a copy going on. now it's downloaded, you don't need to do it again, so we're only talking the one copy here.
> You can’t just say “oh I’ll read the file in and the OS will take care of it”. It doesn’t work that way.
i can and do. and you've already explained swap sufficiently that i believe you know you can do exactly that also.
Please keep in mind the context of the discussion. A prior poster made a claim that they can read a file into memory, and that it won’t actually use any additional memory because the kernel will “automatically take care” of it somehow. This is plainly false.
You come in and say something to the effect of “but it may not have to read from disk because it’s cached”, which… has nothing to do with what was being discussed. We’re not talking about whether it incurs a disk read, we’re talking about whether it will run your system out of memory trying to load it into RAM.
> i mean, you just downloaded it from the network, so unless you think mmap() can hold buffers on the network card, there's definitely going to be a copy going on. now it's downloaded, you don't need to do it again, so we're only talking the one copy here.
What in god’s holy name are you blathering about? If I “just downloaded it from the network”, it’s on-disk. If I mmap() the disk contents, there’s no copy going on, it’s “mapped” to disk. If I read() the contents, which is what you said I should do, then another copy of the data is now sitting in a buffer in my process’s heap. This extra copy is now "using" memory, and if I keep doing this, I will run the system out of RAM. This is characteristically different from mmap(), where a region of memory maps to a file on-disk, and contents are faulted into memory as I read them. The reason this is an extremely important distinction, is that in the mmap scenario, the kernel is free to free the read-in pages any time it wants, and they will be faulted back again in if I try to read them again. Contrast this with using read(), which makes it so the kernel can't free the pages, because they're buffers in my process's heap, and are not considered file-backed from the kernel's perspective.
> i can and do. and you've already explained swap sufficiently that i believe you know you can do exactly that also.
Swap is disabled on my system. Even if it wasn’t, I’d only have so much of it. Even if I had a ton of it, read()’ing 100GB of data and relying on swap to save me is going to grind the rest of the system to a halt as the kernel tries to make room for it (because the data is in my heap, and thus isn’t file-backed, so the kernel can’t just free the pages and read them back from the file I read it from.) read() is not mmap(). Please don’t conflate them.
Yep I do the same. If I have a server with hundreds of GB or even TB of RAM (not uncommon these days) I'm not setting up swap. If you're exhausting that much RAM, swap is only going to delay the inevitable. Fix your program.
> A prior poster made a claim that they can read a file into memory, and that it won’t actually use any additional memory because the kernel will “automatically take care” of it somehow. This is plainly false.
Nobody made that claim except in your head.
Why don't you read it now:
Just curious: why would you not load the entire dataset "into memory" ("into memory" from a Python perspective)?
Look carefully: There's no mention of the word file. For all you or I know the programmer is imagining something like this:
>>> data=loaddata("https://...")
Or perhaps it's an S3 bucket. There is no file, only the data set. That's more or less exactly what I do.
On an average desktop/server system the OS would automatically take care of putting whatever fits in RAM and the rest on disk.
You know exactly this what is meant by swap: We just confirmed that. And you know it is enabled on every average desktop server system, because you
> Swap is disabled on my system
are the sort of person who disables the average configuration! Can you not see you aren't arguing with anything but your own fantasies?
> If I “just downloaded it from the network”, it’s on-disk.
That's nonsense. It's in ram. That's the block cache you were just talking about.
> If I mmap() the disk contents, there’s no copy going on, it’s “mapped” to disk
Every word of that is nonsense. The disk is attached to a serial bus. Even if you're using fancy nvme "disks" there's a queue that operates in a (reasonably) serial fashion. The reason mmap() is referred to zero-copy is because it can reuse the block cache if it has been recently downloaded -- but if the data is paged out, there is absolutely a copy and it's more expensive than just read() by a long way.
> Even if it wasn’t, I’d only have so much of it. Even if I had a ton of it, read()’ing 100GB of data and relying on swap to save me is going to grind the rest of the system to a halt as the kernel tries to make room for it
You only have so much storage, this is life, but I can tell you as someone who does operate a 1tb of ram machine that downloads 300gb of logfiles every day, read() and write() work just fine -- I just can't speak for python (or why python people don't do it) because i don't like python.
You're basically just gish galloping at this point and there's no need to respond to you any more. All your points about swap are irrelevant to the discussion. All your points about disk cache are irrelevant to the discussion. You have a very, very, very incorrect understanding of how operating system kernels work if you think mmap() is just a less efficient read():
> Every word of that is nonsense. The disk is attached to a serial bus. Even if you're using fancy nvme "disks" there's a queue that operates in a (reasonably) serial fashion. The reason mmap() is referred to zero-copy is because it can reuse the block cache if it has been recently downloaded -- but if the data is paged out, there is absolutely a copy and it's more expensive than just read() by a long way.
Please do a basic web search for what "virtual memory" means. You seem to think that handwavey nonsense about disk cache means that read() doesn't copy the data into your working set. You should look at the manpage for read() and maybe ponder why it requires you to pass your own buffer. This buffer would have to be something you've malloc()'d ahead of time. Hence why you're using more memory by using read() than you would using mmap()
> You only have so much storage, this is life, but I can tell you as someone who does operate a 1tb of ram machine that downloads 300gb of logfiles every day, read() and write() work just fine -- I just can't speak for python (or why python people don't do it) because i don't like python.
You should definitely learn what the mmap syscall does and why it exists. You really don't need to use 300gb of RAM to read a 300gb log file. You should probably look up how text editors like vim actually work, and why you can seek to the end of a 300gb log file without vim taking up a bunch of RAM. Maybe you've never been curious about this before, I dunno.
Try making a 20gb file called "bigfile", and run this C program:
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <inttypes.h>
#include <unistd.h>
int main(int argc, char *argv[]) {
char *mmap_buffer;
int fd;
struct stat sb;
ssize_t s;
fd = open("./bigfile", O_RDONLY);
fstat(fd, &sb); // get the size
// do the mmap
mmap_buffer = mmap(NULL, sb.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
uint8_t sum;
// Ensure the whole buffer is touched by doing a dumb math operation on every byte
for (size_t i = 0; i < sb.st_size; ++i) {
sum += mmap_buffer[i]; // overflow is fine, sorta a dumb checksum
}
fprintf(stderr, "done! sum=%d\n", sum);
sleep(1000);
}
And wait for "done!" to show up in stderr. It will sleep 1000 seconds waiting for you to ctrl+c at this point. At this point we will have (1) mmap'd the entire file into memory, and (2) read the whole thing sequentially, adding each byte to a `sum` variable (with expected overflow.)
While it's sleeping, check /proc/<pid>/status and take note of the memory stats. You'll see that VmSize is as big as the file you read in (for me, bigfile is more than 20GB):
VmSize: 20974160 kB
But the actual resident set is 968kb:
VmRSS: 968 kB
So, my program is using 968 kB even though it has a 20GB in-memory buffer that just read the whole file in! My system only has 16GB of RAM and swap is disabled.
How is this possible? Because mmap lets you do this. The kernel will read in pages from bigfile on demand, but is also free to free them at any point. There is no controversy here, every modern operating system has supported this for decades.
Compare this to a similar program using read():
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <inttypes.h>
#include <unistd.h>
int main(int argc, char *argv[]) {
int fd;
struct stat sb;
ssize_t s;
char *read_buffer;
fd = open("./bigfile", O_RDONLY);
fstat(fd, &sb); // get the size
// do the read
read_buffer = malloc(sb.st_size);
read(fd, read_buffer, sb.st_size);
uint8_t sum;
// Ensure the whole buffer is touched by doing a dumb math operation on every byte
for (size_t i = 0; i < sb.st_size; ++i) {
sum += read_buffer[i]; // overflow is fine, sorta a dumb checksum
}
fprintf(stderr, "done! sum=%d\n", sum);
sleep(1000);
}
And do the same thing (in this instance I shrunk bigfile to 2 GB because I don't have enough physical RAM to do this with 20GB.) You'll see this in /proc/<pid>/status:
Oops! I'm using 2 GB of resident set size. If I were to do this with a file that's bigger than RAM, I'd get OOM-killed.
This is why you shouldn't read() in large datasets. Your strategy is to have servers with 1TB of ram and massive amounts of swap, and I'm telling you you don't need this to process this big of files. mmap() does so without requiring things to be read into RAM ahead of time.
Oh, and guess what: Take out the `sleep(1000)`, and the mmap version is faster than the read() version:
$ time ./mmap
done! sum=225
./mmap 6.89s user 0.28s system 99% cpu 7.180 total
$ time ./read
done! sum=246
./read 6.86s user 1.72s system 99% cpu 8.612 total
Why is it faster? Because we don't have to needlessly copy the data into the process's heap. We can just read the blocks directly from the mmap'd address space, and let page faults read them in for us.
(Edit: because I noticed this too, and it got me curious, the reason for the incorrect result is that I didn't check the result of the read() call, and it was actually reading fewer bytes, by a small handful. read() is allowed to do this, and it's up to callers to call it again. It was reading 2147479552 bytes when the file size was 2147483648 bytes. If anything this should have made the read implementation faster, but mmap still wins even though it read more bytes. A fixed version follows, and now produces the same "sum" as the mmap):
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <inttypes.h>
#include <unistd.h>
int main(int argc, char *argv[]) {
int fd;
struct stat sb;
ssize_t s;
ssize_t read_result;
ssize_t bytes_read;
char *read_buffer;
fd = open("./bigfile", O_RDONLY);
fstat(fd, &sb); // get the size
// do the read
read_buffer = malloc(sb.st_size);
bytes_read = 0;
do {
read_result = read(fd, read_buffer + bytes_read, sb.st_size - bytes_read);
bytes_read += read_result;
} while (read_result != 0);
uint8_t sum;
// Ensure the whole buffer is touched by doing a dumb math operation on every byte
for (size_t i = 0; i < sb.st_size; ++i) {
sum += read_buffer[i]; // overflow is fine, sorta a dumb checksum
}
fprintf(stderr, "done! sum=%d", sum);
}
Depending on the source of the data, that is not as good as an actual streaming implementation. That is, if the data is coming from a network API, waiting for it to be "in memory" before processing it still means that you have to store the whole stream on the local machine before you even start. Even if we assume that you are storing it on disk and mmapping it into memory that's still not a good idea for many use cases.
Not to mention, if the code is not explicitly designed to work with a streaming approach, even for local data, might mean that early steps accidentally end up touching the whole data (e.g. they look for a closing } in something like a 10GB json document) in unexpected places, costing orders of magnitude more than they should.
JSON discourages but does not forbid duplicate keys. In case of duplicate keys browsers generally let the last instance win. So if you want to be compatible with that, you always must read the whole document.
Example: I'm querying a database, producing a file and storing it in object storage (S3). The dataset is 100 gigabytes in size. I should not require 100 gigabytes of memory or disk space to handle this single operation. It would be slower to write it to disk first.