Not having intermediate pose estimation is interesting to me. It just seems like it would need something similar to that to at least be automatically learned. Maybe it effectively recognizes different image patches and their relevant movements? And so the full pose is just extra information that makes things less efficient?
The third dimension of the CNN is for time?
I wonder if it could be useful to have a pose estimation that would compliment the pixels. So not throwing out the pixels after the pose estimation. That way if the pose wasn't perfectly detected or there was other relevant information in the scene then it would all be considered. But correct pose estimation could still be useful to the network in those cases.
Not having intermediate pose estimation is interesting to me. It just seems like it would need something similar to that to at least be automatically learned. Maybe it effectively recognizes different image patches and their relevant movements? And so the full pose is just extra information that makes things less efficient?
The third dimension of the CNN is for time?
I wonder if it could be useful to have a pose estimation that would compliment the pixels. So not throwing out the pixels after the pose estimation. That way if the pose wasn't perfectly detected or there was other relevant information in the scene then it would all be considered. But correct pose estimation could still be useful to the network in those cases.