That looks great so far! Pair it with open cloze with translation once you determine that the user should know the word and you've got magic going. You could even "beep" the word out of the recording (I'm assuming they're TTS).
We use microsoft TTS, was trying to figure out if they can give time ranges for beeping out words.. didn't see it. Holy grail would be just generating audio files to listen to, no front end at all. :)
Edit: typo