In the May issue of the SMPTE Motion Imaging Journal,Aaron Chadha, Mohammad Ashraful Anam, Matthias Treder, Ilya Fadeev, and Yiannis Andreopoulos present their article, “Toward Generalized Psychovisual Preprocessing for Video Encoding.” The engineers, all with London, England’s iSIZE Technologies, are exploring the use of deep perceptual preprocessing to mimic key elements of human vision, automatically removing details from input video that is perceptually unimportant yet incur a large rate and complexity overhead during encoding. In rigorous tests, they have demonstrated their solution to deliver significant bitrate savings across several generations of video encoders without breaking standards or requiring any changes to encoding, delivery, or decoding devices.
The iSIZE team notes that numerous studies have shown the signal-to-noise ratio to be a poor indicator of visual quality in video coding. Instead, the research community is moving towards perceptual optimization metrics that encapsulate elements of human perception, perceptual modeling of encoding artifacts, and viewing setup awareness. Current perceptual optimization approaches, however, have downsides: requirement of multiple encoding passes, optimizing for a single metric that’s detrimental to other quality metrics, failure to encapsulate characteristics of human vision in a data-driven and learnable manner, or complexity that makes them prohibitively costly for real-time application.
They propose that a generalized video preprocessing framework must include five principles for it to be practical for video coding and streaming systems:
To this end, they present a deep neural network-based framework that meets all five principles. Their experiments with four different perceptual quality metrics show that this framework offers consistent bitrate savings over multiple state-of-the-art encoders for various content types. It also allows for very efficient runtime on processing hardware, making it applicable for a wide range of deployment scenarios.
Training phase: Their proposed generalized psychovisual preprocessor consists of three steps:
Deployment phase: Once the preprocessor has been optimized, both the virtual codec and psychovisual model are discarded. The trained preprocessor can now be deployed with any standard codec and setting, operating with a single pass over all resolutions and bitrates.
Since psychovisual preprocessing can be applied to many types of content under diverse encoding conditions, the team first evaluated their approach with three categories of 1080p content, the typical resolution users receive from premium online services: XIPH video, user-generated live music and sports, and user-generated game streaming applications. They found that savings were especially great for more challenging content like sports, gaming, and high-motion music videos, where motion-compensated prediction in encoders typically struggle to encapsulate the scene dynamics. In further tests, over five resolutions and two different encoding technologies, the psychovisual preprocessing + encoder solution obtained a bitrate savings of 11.5% with an AVC encoder and 20.5% with a VP9 encoder, compared to the use of the encoder alone. All experiments showed that their approach leads to bitrate savings for all metrics, all encoders, and all content types, with a low delay of less than 2ms inference time per frame.
In short, the team demonstrated that their approach meets all five principles for a generalized psychovisual preprocessing system applicable for both video on demand and live content – without imposing any changes in the encoding, packaging, transport, or decoding.
For a deep dive on the topic, read the complete article in this month’s SMPTE Motion Imaging Journal.
Also be sure to check out SMPTE’s newest virtual course, “Essentials of Video and Audio Compression,” beginning June 6. It’s a perfect way to gain a comprehensive background in the technology, standards, and typical workflows as well as complex hosting issues, performance, and user cases associated with implementing compression solutions.
Keywords: AI, deep learning, deep neural networks, perceptual optimization, video delivery, compression, video encoding, bitrate