Hacker News new | past | comments | ask | show | jobs | submit login
Working with Binary Data in Python (devdungeon.com)
202 points by signa11 on Dec 7, 2015 | hide | past | favorite | 32 comments



If you've got something simple like a binary file full of uints (e.g. a raw image), then numpy provides fromfile which is incredibly simple to use. If you have tabular data in a known format, you can also pass fairly complicated datatypes.

np.fromfile('data.dat', dtype='uint8')

https://docs.scipy.org/doc/numpy/reference/generated/numpy.f...

EDIT: Obviously you have the added benefit that your data is returned into a numpy array for easy reshaping etc


This is a good article, but what I've been struggling with recently is dealing with literal binary data in Python.

What I mean by that is I literally have data stored in memory, and I want to deal with it in arbitrary groups of bits.

Sometimes 1 bit, sometimes 7 bits, rarely 13 bits.

My Google's haven't turned up much precedent.

</off topic complaint>


Have you tried bitarray[1]?

[1]: https://pypi.python.org/pypi/bitarray


It was that easy? Man. Looking forward to giving these a shot...


I've used it quite heavily to parse data that comes in 85- and 1275-bit chunks and it works awesome.

Edit: sorry, it was bitstring, not bitarray. bitstring.BitArray :).


Obligatory XKCD: https://xkcd.com/353/


+1 from me. Excellent library



Why couldn't I figure this out by myself?! Thanks!



It's rough, but I recently explored/compared various struct-like packages https://github.com/moreati/structlytyped


Very cool. Any conclusions from the experiment?

Disclosure: I'm the primary author/maintainer of suitcase (https://github.com/digidotcom/python-suitcase). In your example, you can shave off a line by doing the following:

    example = FIDOAttestation.from_data(a)


No conclusions yet, the script (together with https://github.com/moreati/parsedover) is mainly a dumping ground/personal notes while I try various parsing approaches. https://github.com/moreati/structlytyped was a diversion into parsing binary data.

I'm trying various not-regex syntaxes. My motivation is: we all agree regex is bad, but we keep using it because it's available everywhere and hence the least common denominator (a schelling point). What would it take to get beyond regex as the goto pattern matching/parsing? At least for casual use.


I found construct quite good to parse binary data too. Had used antlr with java but needed something simpler and for python.

https://pypi.python.org/pypi/construct



Thanks for the link, I didn't know about construct!

A colleague of mine recently developed a similar (from the looks of it) library, in Python and Javascript:

https://github.com/jfjlaros/bin-parser


Unfortunately it's not under active development for several years. Lots of good ideas though.


While I understand the ease of use pure Python provides, I think when manipulating real binary or in memory data, a little bit of C would really simplify and speed things up.

Writing raw C extensions can be a pain, but with Cython or CFFI using C from Python has become extremely easy.


> when manipulating real binary or in memory data, a little bit of C would really simplify

In what sense? Accessing an octet in a buffer and performing basic bitwise operations on it is equivalent:

   # C _and_ Python
   some_buffer[idx] & 0x80
Reading a buffer in Python is slightly less error-prone, as errors are automatically converted to exceptions, which can be caught at a more centralized location in the code; in C, one must check for errors (and manually propagate those errors) at every read; this applies a bit less in C++, though even there, the stdlib leaves some things to be desired (but generally speaking, they can be wrapped away).

Text encoding/decoding, struct packing, and base64 aren't in C or its standard library.

Now, in terms of speed, C wins hands down. (I wrote a minecraft file parser in both C++ and Python, and the difference was several orders of magnitude.)


Distributing that becomes much more of a pain, though. Unless it's a major component of the software, I'd really rather keep it in pure Python.


This seems intuitively obvious, but can you think of an example that wouldn't be answered with a competitive implementation using Cython or Numba?


This is one of those articles I want to "save" in my hn account but I don't know how.


If you upvote a story, it will be saved on your profile page under the "saved stories" link.


Can't believe I haven't heard of this before. Thanks!


I wish I had this while doing my assignment on implementing DES. I switched to java for the next assignment which involved manipulating binary data


This type of work is much easier in Python 3.


It's still very cumbersome, especially compared with state of the art. You may want to look at how it works in Erlang.


Erlang's binary matching is beyond awesome, and OCaml has a pretty good implementation of it as well (Bitstring) [1].

Cool thing about Elixir is of course that awesome Erlang binary handling is available and very easy to use.

Lots of people have been lobbying for something like this in Rust, given its low-level orientation, but there's been resistance to the idea from the core devs, who want it to be done using syntax extension.

1. https://rwmj.wordpress.com/2010/02/03/on-the-awesomeness-of-...


Agreed. I remember unpacking binary data in Python 2.7 during my late undergrad and found it much more cumbersome. I bookmarked the article in celebration the next time I'm doing something like that again.


All python 3 does is add an onion layer around unicode. Fortunately it looks like these techniques will work for python 2.7 or 3.x. With 2.7 obviously being the preference for pretty much everyone who isn't a python 3 developer.


The cases which one would work with binary data are few and far between. And for the purposes of preserving sanity across architectures, I often find using a cross-platform file format like HDF5 helpful.


I disagree. The cases where I would do so voluntarily are limited, but I've been forced to deal with binary data on several occasions. HDF5 is also my goto if I am in control, but sometimes you're given data that is only in some obscure binary format.

Unless you were implying that such work is best done in another language. I can buy that, but I found Python to be a convenient choice for it.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: