Here's an anecdote. My friend's apartment got burglarized many years ago and when we looked at the security footage we clearly saw the thieves taking everything, their faces were impossible to recognize though, due to low resolution. Ever since that happened I kept thinking of a video codec that would store the whole video in low-res but recognize faces and encode those parts in ultra high-res. I hope research like this can lead to better security.
Super resolution can only do so much. Basically, a x2 or x4 improvement at best, especially on features more complex than text. Also, as soon as you compress your video stream you lose a lot of information that would actually be used by these algorithms.
If a high res stream is available, it is much better to use it. A basic face detection algorithm and snapshots of them saved regularly would go a long way and are really simple to implement.
Don't they do something like that to boost the resolution of space telescopes? Also I seem to remember reading something about processing a stream of images from an earthbound telescope to cancel out atmospheric distortion.
Space telescopes don't do super-resolution of the type used for video, but they do something a little similar. They use a technique called aperture synthesis which combines signals from a collection of instruments so that they have the same angular resolution as a much larger virtual instrument.
Many video conferencing/calling applications already do this, allocating more bits to regions recognized as faces when encoding. You can test it out in FaceTime for example by showing your face vs obscuring enough of it that face detection fails.
The dynamic gaze example really convinced me that eye tracking will be necessary for immersive VR. If you can achieve a 1+ order of magnitude improvement in rendering performance with no noticeable loss in quality... it would be very difficult to leave that on the table.
Not necessarily. The "lower framerate" in our fovea is not represented as a stuttering sequence of frames, but a blurry and smooth flow. Simply using a lower framerate would still be noticeable.
Unless you could engineer a display technology that could do this.
Dropping of certain pixels is a very peculiar way of reducing input quality. Why was that method chosen?
For 3D rendering I guess that's a kind of DLSS, but the paper focuses on video compression.
For video streams that doesn't seem to make sense. Video codecs are not pixel-based, but block/frequency based, so you can't save any bandwidth by dropping pixels. Raw pixels don't compress well, especially less correlated samples like that, so I wouldn't be surprised if sending just the reduced input for this algorithm was more costly than sending a full video stream. And existing video codecs can already very effectively vary quality within the frame by varying block sizes and quantization.
"Compression" is probably just poorly chosen wording. This has more to do with reducing the number of required samples in applications like eye-tracking VR, where you can choose to render a dense image for the part that the user is looking at, while reducing detail in peripheral vision. Current implementations use some fractional resolution(s) for the periphery and blend pixels using more traditional methods, which results in blurryness and/or aliasing artifacts.
Those are some pretty incredible results. For any single frame I found it hard to find a significant quality loss between the DeepFovea frame and the reference (obviously while looking at the Foveal target and trying to compare peripheral quality), but in motion there was a lot of interframe noise / aliasing / jitter.
While I'm sure they'll improve on those issues I'm currently wondering what kind of visual peripheral trade offs I'd make; if I had a demo in front of me I'd bet that I'd prefer running at higher foveal settings / fidelity with peripheral artifacts to running at lower overall settings / fidelity to avoid them.
Had this idea at least 10 years ago. Have many ideas, that said...
Fovea-oriented compression can be useful for optimized bandwidth usage in video conferencing, too.
One could even implement auto-reframe of video feed when several participants are in the same room without need for a mechanical camera moving. Or something like liquid rescale to still get a glimpse of the rest of the full frame.
Perhaps those ideas were since patented and even developed?
haven't read paper yet, but this is a silly demo. "turn 10% of pixels black" is not a good baseline, should use nearest-neighbor interp (or something) to fill holes in the "sparse" video for fair comparison. also, you can clearly see in hd video that it's temporally unstable ("shimmering"), which is the same problem nvidia has with dlss since forever; need to build temporal smoothing in or users will hate it.
Can this be used to replace Photoshop's Content-aware fill as well? Or does it require some sparse sampling of the whole area that needs to be reconstructed?
OTOH, government surveillance...