Why can’t you pickle generators in Python? A pattern for saving training state

raffi · on Dec 22, 2009

Pickling generators is one step from being able to pickle a continuation or save the state of a running function. This is actually a useful thing to do. I wrote a script awhile that spawned several processes that each required posting multiple jobs to Mechanical Turk and doing so in a certain sequence. My language Sleep supports continuations (and generators) and allows me to serialize (pickle) my continuations. I just spawned off a thread for each "process" and periodically saved that executing function to disk. Later when I wanted to restart the program, it'd look on disk for the saved functions, load them, and execute them. Very fun.

I wrote on article about how to do this kind of stuff awhile back:

http://today.java.net/pub/a/today/2008/07/24/fun-with-contin...

kevingadd · on Dec 22, 2009

Seems like a misguided way of solving a basic refactoring problem. What he actually wants to pickle is a data structure that the generator depends on. The use of some ALL CAPS magic variable name and global state rings all sorts of warning bells in my head when I look at it.

However, it's hard to tell why he actually needs to pickle generators from looking at his examples. I can think of some good hypothetical reasons, but his examples don't justify it - it looks equivalent to 'I want to pickle a suspended thread' to me.

bravura · on Dec 22, 2009

I am the original author.

The example holds without the ALL CAPS magic variable names, "HYPERPARAMETERS". However, I include HYPERPARAMETERS because I am including the actual code I am using. Hyperparameters are global, read-only variables that specify the particular experimental condition being tested. I can't say that I have the best solution to this particular aspect of experimental control (hyperparameters). I might write a blog post about it in the future, to solicit feedback on improved methods. However, I have refined my current approach over several years, and I currently use the assumption: One experiment per process. Hence, one set of hyperparameters---specified at invocation---per process. This assumption has saved me a lot of pain. As I said, I am interested to discuss alternatives.

The training state, however, is not global. You can pickle the training state objects individually. I was considering an ugly global way to refactor, but instead I used this pattern. Which is the reason I wrote the article.

I do come to the conclusion that you must pickle the data structure that the generator depends on. This is why I refactor it into a class object with __getstate__ and __setstate__ methods.

The reason I want to pickle generators is as follows: Generators are the easiest way to write methods that stream input. Because, if you stop and restart, you want to stream from where you left off, refactoring these generators is on the critical path to persisting your experimental state.

bcl · on Dec 22, 2009

Nice! I haven't needed to pickle generators, but that should come in handy when I do.

jcl · on Dec 22, 2009

If I'm reading it correctly, he's not really pickling the generator itself. Instead, he's counting the number of times he called the generator and pickling that instead. Then, to "unpickle", he retrieves the count and calls the generator that many times.

Two major limitations of this approach: (1) You can only pickle generators that generate the same sequence every time they are restarted. (2) All the work the generator did prior to pickling must be performed again on unpickling.

bravura · on Dec 23, 2009

Good criticisms.

(1) You can only pickle generators that generate the same sequence every time they are restarted.

I don't know how you can persist state if you do not make this assumption.

(2) All the work the generator did prior to pickling must be performed again on unpickling.

Something faster would be to use file.tell() to get the state and file.seek() to set the state. Since the "unpickling" is not a bottleneck, I didn't optimize this.