AI-Driven Audio Engine in the Cloud Delivers Live XR Experiences

In the August issue of the SMPTE Motion Imaging Journal, Rob G. Oldfield, Max S. S. Walley, Ben G. Shirley, and Doug L. Williams present their research on “Cloud-Based AI for Automatic Audio Production for Personalized Immersive XR Experiences.” This 5G Edge-XR research project led by British Telecommunications explores how a combination of 5G connectivity and Graphics Processing Unit (GPU) cloud capability at the network edge can drive immersive experiences including augmented, virtual, and mixed reality – collectively extended reality, or “XR experiences” – to consumer devices. The project focuses primarily on the audio system, utilizing a machine-learning approach for automatic audio source recognition/extraction, composition, and mixing.

The AV content in XR experiences is typically made up of volumetric video (recorded so it appears to take up 3D space), broadcast production content, data, and additional video feeds generated from devices including cameras and microphones. This content is encoded, uploaded over 5G for processing in the GPU, realtime scene compilation, and rendering that reflects the view from the pose and orientation of the user’s headset or other device. The scene is then delivered and decoded at the end-user device over the 5G network. At all times, the viewer has the ability to change their content viewpoint freely on an AR headset to create a personalized, immersive, interactive experience.

The project’s use case involves the generation of a photorealistic hologram of a boxing match that, using AR, appears to be situated on the viewers’ coffee table. The presentation is synchronized with the live TV broadcast feed and contains interactive elements that allow the viewer to personalize the presentation by selecting a replay or interacting with graphics for more information. Because the viewers can walk around the hologram, the audio must change to match the changing perspective of the viewer. The viewers experience an ambient audio “bed” with the background audio feed, but as they personalize content and navigate within a scene, they also receive a bespoke audio feed that matches the visual representation. This might include the sound of a punch, the shout of a trainer, or the blast of a referee’s whistle.

The challenge from an audio perspective is to extract each sound source/object and localize it in space in realtime within a noisy environment. At the same time, it’s imperative to capture and preserve as much content, information, and metadata about the content as possible. The project therefore adopted a “object-based audio” paradigm. Audio objects are typically discrete audio sources in the scene with accompanying metadata describing their location, type, loudness, duration, format, and other attributes and signal statistics. These audio objects are rendered at a specific point in space and overlaid with one or more ambient beds – for instance a base crowd sound, a home crowd sound, and an away crowd sound – which make up the immersive background.

Object-based audio differs from traditional channel-based approaches in that, instead of mixing audio content for a specific audio output format, the audio components and metadata are composed at the capture end but retained as discrete assets through the entire production chain. At any point, sources can be added, removed, altered, or panned around, and additional processing stages and systems employed, right up until the point of bespoke rendering at the user end. When multiple personalized mixes are required at this render end, it’s imperative to have an automated mixing stage within the system architecture in order to compile the immersive and personalized mix.

Volumetric capture generates huge amounts of data which must be analyzed, encoded, and transmitted efficiently to the cloud to be rendered, frame by frame, into a photorealistic hologram of the virtual camera view. In the past, realtime lossless audio processing in the cloud had been difficult

due to bandwidth limitations, however a 5G network greatly improves the amount of bandwidth available, allowing transmission of multichannel uncompressed audio into the cloud. A 5G network also allows GPU acceleration to greatly increase the power for more complex processing tasks like realtime audio object analysis/extraction, localization, and semantic analysis of the incoming streams. Furthermore, the cloud-based audio mix engine enables automated content composition for different audiences.

For audio extraction, the team employed machine learning to analyze various representations of the audio signal, learning complex patterns that allow them to detect when specific audio events occur. The audio events were also analyzed for various acoustical characteristics, transience, and salience, then selected based on which ones show the most differences between the event and any background audio. This helps the neural network pick up on patterns, which will ultimately improve its detection. A second non-neural network may also be used as a filter for transient events to limit the number of false positives produced by the network.

The AI for this project was written in the C++ programming language using the TensorFlow open-source software library, so it is cross-platform compatible. However, in this case, the team deployed the AI on the Linux edge computer server to take advantage of its more efficient performance for complex calculations using GPU Acceleration. This factor is vital since AI processing is computationally very expensive and needs to run on multiple audio channels simultaneously.

The authors consider the 5G Edge-XR project to be an excellent test for what is achievable using fast 5G networks that unlock the processing power of GPUs in the cloud. They believe the project can open the door for future applications of AI in the cloud using enhanced methods to produce higher-quality content at a lower cost for current workflows.

For a much deeper dive into the 5G Edge-XR project, read the complete article in this month’s SMPTE Motion Imaging Journal

SMPTE Opens Call for Papers for 2026 Media Technology Summit

SMPTE’s Vision and Mission: Shaping the Future of Media in 2026 and the Next Century

Understanding SMPTE ST 2110 - Instructor-Led Course

Introduction to IP Networks (ITIPN)

SMPTE Opens Call for Papers for 2026 Media Technology Summit

SMPTE’s Vision and Mission: Shaping the Future of Media in 2026 and the Next Century

Understanding SMPTE ST 2110 - Instructor-Led Course

Introduction to IP Networks (ITIPN)

AI-Driven Audio Engine in the Cloud Delivers Live XR Experiences

SMPTE Content

Related Posts

SMPTE Opens Call for Papers for 2026 Media Technology Summit

SMPTE’s Vision and Mission: Shaping the Future of Media in 2026 and the Next Century

Understanding SMPTE ST 2110 - Instructor-Led Course

Introduction to IP Networks (ITIPN)

SMPTE Opens Call for Papers for 2026 Media Technology Summit

SMPTE’s Vision and Mission: Shaping the Future of Media in 2026 and the Next Century

Understanding SMPTE ST 2110 - Instructor-Led Course

Introduction to IP Networks (ITIPN)

AI-Driven Audio Engine in the Cloud Delivers Live XR Experiences

SMPTE Content

Related Posts

The Building Blocks of Extended Reality: How Separate Technologies Can Work Together

Pursuing Ubiquitous OTT

5G Streaming Media Promises Huge New Capabilities for Enriched Hybrid Services