Working with Binary Data in Python

joshvm · on Dec 7, 2015

If you've got something simple like a binary file full of uints (e.g. a raw image), then numpy provides fromfile which is incredibly simple to use. If you have tabular data in a known format, you can also pass fairly complicated datatypes.

np.fromfile('data.dat', dtype='uint8')

https://docs.scipy.org/doc/numpy/reference/generated/numpy.f...

EDIT: Obviously you have the added benefit that your data is returned into a numpy array for easy reshaping etc

robzyb · on Dec 7, 2015

This is a good article, but what I've been struggling with recently is dealing with literal binary data in Python.

What I mean by that is I literally have data stored in memory, and I want to deal with it in arbitrary groups of bits.

Sometimes 1 bit, sometimes 7 bits, rarely 13 bits.

My Google's haven't turned up much precedent.

</off topic complaint>

RazerM · on Dec 7, 2015

Have you tried bitarray[1]?

[1]: https://pypi.python.org/pypi/bitarray

robzyb · on Dec 7, 2015

It was that easy? Man. Looking forward to giving these a shot...

tonyarkles · on Dec 7, 2015

I've used it quite heavily to parse data that comes in 85- and 1275-bit chunks and it works awesome.

Edit: sorry, it was bitstring, not bitarray. bitstring.BitArray :).

toyg · on Dec 7, 2015

Obligatory XKCD: https://xkcd.com/353/

_yy · on Dec 7, 2015

+1 from me. Excellent library

switch007 · on Dec 7, 2015

Also https://pythonhosted.org/bitstring

robzyb · on Dec 7, 2015

Why couldn't I figure this out by myself?! Thanks!

ddorian43 · on Dec 7, 2015

or https://pypi.python.org/pypi/intbitset/2.2.1

moreati · on Dec 7, 2015

It's rough, but I recently explored/compared various struct-like packages https://github.com/moreati/structlytyped

posborne · on Dec 7, 2015

Very cool. Any conclusions from the experiment?

Disclosure: I'm the primary author/maintainer of suitcase (https://github.com/digidotcom/python-suitcase). In your example, you can shave off a line by doing the following:

    example = FIDOAttestation.from_data(a)

moreati · on Dec 8, 2015

No conclusions yet, the script (together with https://github.com/moreati/parsedover) is mainly a dumping ground/personal notes while I try various parsing approaches. https://github.com/moreati/structlytyped was a diversion into parsing binary data.

I'm trying various not-regex syntaxes. My motivation is: we all agree regex is bad, but we keep using it because it's available everywhere and hence the least common denominator (a schelling point). What would it take to get beyond regex as the goto pattern matching/parsing? At least for casual use.

kinow · on Dec 7, 2015

I found construct quite good to parse binary data too. Had used antlr with java but needed something simpler and for python.

https://pypi.python.org/pypi/construct

js2 · on Dec 7, 2015

Here's an example of fairly extensive use of construct:

http://amnesia.gtisc.gatech.edu/~moyix/minidump.py

Explanation:

http://moyix.blogspot.com/2008/05/parsing-windows-minidumps....

mvermaat · on Dec 7, 2015

Thanks for the link, I didn't know about construct!

A colleague of mine recently developed a similar (from the looks of it) library, in Python and Javascript:

https://github.com/jfjlaros/bin-parser

luch · on Dec 7, 2015

Unfortunately it's not under active development for several years. Lots of good ideas though.

joiguru · on Dec 7, 2015

While I understand the ease of use pure Python provides, I think when manipulating real binary or in memory data, a little bit of C would really simplify and speed things up.

Writing raw C extensions can be a pain, but with Cython or CFFI using C from Python has become extremely easy.

deathanatos · on Dec 7, 2015

> when manipulating real binary or in memory data, a little bit of C would really simplify

In what sense? Accessing an octet in a buffer and performing basic bitwise operations on it is equivalent:

   # C _and_ Python
   some_buffer[idx] & 0x80

Reading a buffer in Python is slightly less error-prone, as errors are automatically converted to exceptions, which can be caught at a more centralized location in the code; in C, one must check for errors (and manually propagate those errors) at every read; this applies a bit less in C++, though even there, the stdlib leaves some things to be desired (but generally speaking, they can be wrapped away).

Text encoding/decoding, struct packing, and base64 aren't in C or its standard library.

Now, in terms of speed, C wins hands down. (I wrote a minecraft file parser in both C++ and Python, and the difference was several orders of magnitude.)

icebraining · on Dec 7, 2015

Distributing that becomes much more of a pain, though. Unless it's a major component of the software, I'd really rather keep it in pure Python.

rch · on Dec 7, 2015

This seems intuitively obvious, but can you think of an example that wouldn't be answered with a competitive implementation using Cython or Numba?

aswanson · on Dec 7, 2015

This is one of those articles I want to "save" in my hn account but I don't know how.

pault · on Dec 7, 2015

If you upvote a story, it will be saved on your profile page under the "saved stories" link.

rudolf0 · on Dec 7, 2015

Can't believe I haven't heard of this before. Thanks!

mataug · on Dec 7, 2015

I wish I had this while doing my assignment on implementing DES. I switched to java for the next assignment which involved manipulating binary data

polar · on Dec 7, 2015

This type of work is much easier in Python 3.

dozzie · on Dec 7, 2015

It's still very cumbersome, especially compared with state of the art. You may want to look at how it works in Erlang.

jnbiche · on Dec 7, 2015

Erlang's binary matching is beyond awesome, and OCaml has a pretty good implementation of it as well (Bitstring) [1].

Cool thing about Elixir is of course that awesome Erlang binary handling is available and very easy to use.

Lots of people have been lobbying for something like this in Rust, given its low-level orientation, but there's been resistance to the idea from the core devs, who want it to be done using syntax extension.

1. https://rwmj.wordpress.com/2010/02/03/on-the-awesomeness-of-...

danbolt · on Dec 7, 2015

Agreed. I remember unpacking binary data in Python 2.7 during my late undergrad and found it much more cumbersome. I bookmarked the article in celebration the next time I'm doing something like that again.

daveguy · on Dec 7, 2015

All python 3 does is add an onion layer around unicode. Fortunately it looks like these techniques will work for python 2.7 or 3.x. With 2.7 obviously being the preference for pretty much everyone who isn't a python 3 developer.

coherentpony · on Dec 7, 2015

The cases which one would work with binary data are few and far between. And for the purposes of preserving sanity across architectures, I often find using a cross-platform file format like HDF5 helpful.

gh02t · on Dec 7, 2015

I disagree. The cases where I would do so voluntarily are limited, but I've been forced to deal with binary data on several occasions. HDF5 is also my goto if I am in control, but sometimes you're given data that is only in some obscure binary format.

Unless you were implying that such work is best done in another language. I can buy that, but I found Python to be a convenient choice for it.