Psychovisual Preprocessing Enables Compression Bitrate Savings

Written by SMPTE Content | May 17, 2022 12:41:14 PM

In the May issue of the SMPTE Motion Imaging Journal,Aaron Chadha, Mohammad Ashraful Anam, Matthias Treder, Ilya Fadeev, and Yiannis Andreopoulos present their article, “Toward Generalized Psychovisual Preprocessing for Video Encoding.” The engineers, all with London, England’s iSIZE Technologies, are exploring the use of deep perceptual preprocessing to mimic key elements of human vision, automatically removing details from input video that is perceptually unimportant yet incur a large rate and complexity overhead during encoding. In rigorous tests, they have demonstrated their solution to deliver significant bitrate savings across several generations of video encoders without breaking standards or requiring any changes to encoding, delivery, or decoding devices.

The iSIZE team notes that numerous studies have shown the signal-to-noise ratio to be a poor indicator of visual quality in video coding. Instead, the research community is moving towards perceptual optimization metrics that encapsulate elements of human perception, perceptual modeling of encoding artifacts, and viewing setup awareness. Current perceptual optimization approaches, however, have downsides: requirement of multiple encoding passes, optimizing for a single metric that’s detrimental to other quality metrics, failure to encapsulate characteristics of human vision in a data-driven and learnable manner, or complexity that makes them prohibitively costly for real-time application.

They propose that a generalized video preprocessing framework must include five principles for it to be practical for video coding and streaming systems:

Psychovisual tuning: Inspired by and encapsulating known principles of human vision in a data-driven, learnable manner.
Multimetric gains: Allowing for gains over multiple quality measurements, including well-established distortion-oriented and perception-oriented metrics.
Cross-content and cross-codec applicable: Offering compounded gains over other optimization frameworks.
Low delay: Allowing for single-pass processing per encoding resolution/bitrate, or even better, a single-pass model for multiple resolutions/bitrates.
Low complexity: Offering an inference engine (which applies logical rules to the knowledge base during the training process) with a complexity analogous to low-complex encoding.

To this end, they present a deep neural network-based framework that meets all five principles. Their experiments with four different perceptual quality metrics show that this framework offers consistent bitrate savings over multiple state-of-the-art encoders for various content types. It also allows for very efficient runtime on processing hardware, making it applicable for a wide range of deployment scenarios.

Training phase: Their proposed generalized psychovisual preprocessor consists of three steps:

Preprocessor: The preprocessor to be optimized is a convolutional neural network trained to remove information from the content that costs bitrate, while maintaining areas of perceptual importance, such as those with high motion or contrast. It learns both spatial and temporal dependencies, then creates reference frames which are passed, together with the preprocessed frames, to the virtual codec.
Virtual codec: The virtual codec provides some level of encoder awareness to the preprocessor so that bitrate savings achieved during training translates to the deployment phase, where the virtual codec would be replaced with any standard codec.
Psychovisual model: The psychovisual model attempts to filter the frames and isolate key areas of perceptual distortion in the reconstructed frame. It uses a contrast sensitivity function to filter the frame’s frequency subbands, which attaches a lower weight to low-contrast pattern stimuli and higher spatial frequencies that would not be visible to the human eye.

Deployment phase: Once the preprocessor has been optimized, both the virtual codec and psychovisual model are discarded. The trained preprocessor can now be deployed with any standard codec and setting, operating with a single pass over all resolutions and bitrates.

Since psychovisual preprocessing can be applied to many types of content under diverse encoding conditions, the team first evaluated their approach with three categories of 1080p content, the typical resolution users receive from premium online services: XIPH video, user-generated live music and sports, and user-generated game streaming applications. They found that savings were especially great for more challenging content like sports, gaming, and high-motion music videos, where motion-compensated prediction in encoders typically struggle to encapsulate the scene dynamics. In further tests, over five resolutions and two different encoding technologies, the psychovisual preprocessing + encoder solution obtained a bitrate savings of 11.5% with an AVC encoder and 20.5% with a VP9 encoder, compared to the use of the encoder alone. All experiments showed that their approach leads to bitrate savings for all metrics, all encoders, and all content types, with a low delay of less than 2ms inference time per frame.

In short, the team demonstrated that their approach meets all five principles for a generalized psychovisual preprocessing system applicable for both video on demand and live content – without imposing any changes in the encoding, packaging, transport, or decoding.

For a deep dive on the topic, read the complete article in this month’s SMPTE Motion Imaging Journal.

Also be sure to check out SMPTE’s newest virtual course, “Essentials of Video and Audio Compression,” beginning June 6. It’s a perfect way to gain a comprehensive background in the technology, standards, and typical workflows as well as complex hosting issues, performance, and user cases associated with implementing compression solutions.

Keywords: AI, deep learning, deep neural networks, perceptual optimization, video delivery, compression, video encoding, bitrate

View full post