AI for Audio Content Creation

As a longtime researcher into the mysteries of signal processing, speech, acoustics, and human auditory perception, Sunil Bharitkar, principal research scientist for AI research at Samsung Electronics, is particularly excited about the potential for machine learning/artificial intelligence to further the creative agenda of the audio industry. When it comes to the creation of audio content, in particular, Bharitkar suggests that potential is huge, but likely to take some time to fulfill because, among other things, researchers are only now beginning to understand the best ways to apply and maximize the full capabilities of this new technology.

“Some of the things we see the future holding, for example, is improving the role of the mixer when creating audio content for the cinema,” Bharitkar says. “We want to better understand how the mixer authors content, and what parts of that process we could automate to a certain extent—not to eliminate the [human element], but rather, to improve the efficiency of the people involved in content creation in terms of leveraging some of their production tools using AI.

“Could AI basically mimic as best as possible what the content creator or author wants to do, taking [basics of the process] to a certain point before the mixer fine-tunes the mix? Since certain mixes are two-channel mixes that are basically standardized before the mixer goes in and tweaks particular things like range compression based on the signal, maybe we could automate that process and free the mixer to concentrate only on the creative tasks. The idea to make the mixer’s creative job easier by freeing more time for them to maximize the artistic contributions.”

Bharitkar points to a recent article titled “Context-Aware Intelligent Mixing Systems” in the Journal of the Audio Engineering Society (AES) penned by European researchers that addresses so-called Intelligent Mixing Systems (IMS). The article suggests that human creative skill and AI tools could potentially co-exist nicely as long as context is factored into the collaboration, meaning human decision-making needs to be essentially the controlling factor in how, when, and to what degree IMS technology is used in the creative process. The same group penned a second article on one approach to intelligent music production tools aided by machine learning—in this case, the potential for a deep neural network to automate much of the process of creating drum mixes.

“There is work happening in this area, but it’s not an easy task to solve how best to use [AI] for these kinds of processes,” he adds. “In my view, we need to start small to figure out the right approach to better understanding how a mix is working, and then scale it up appropriately with more content and more artists trying it, but it’s still an early work.”

This is just one example of several creative areas where machine learning could help automate certain processes without inherently changing the human nature of the creative process. However, Bharitkar strongly emphasizes that the basic problem with traveling this road is the fact that, at this point in time, “AI remains a black box that you cannot interpret, nor reproduce over other data sets very easily.”

By that, he means that the nature of machine learning involves training a computer to understand, master, and intuitively know when to deploy, or not deploy, certain processes. But, he adds, learning models to train an AI system sink or swim based on the nature of the data the system is fed to begin with, and even then, researchers do not always glean a good understanding of why AI systems produce the results they produce.

“There is a lot of research happening in the area of deep neural networks, and a lot of different variations and models are coming out,” Bharitkar states. “These are extremely computational intensive models, and they are best utilized in the Cloud, on dedicated accelerators, or on computing systems designed for content creation for production. But one of the big challenges that exists with these kind of AI models is that they work very good with the data set that they are trained on, but there is a big problem regarding the reproducibility of other data sets. It’s a work in progress to get better and broader data sets so that, hopefully, we can take that specific model and use it reasonably well over various arbitrary data sets.

“There is also the issue of the quality of the data that you use to create your model. If the quality and pre-processing of that data is not of good quality, you may end up with a model that won’t generalize well with arbitrary data, and something that only works well with a very limited data set. So, the model is only as good as the data that you provide and the pre-processing that you do.”

These challenges all connect to the overriding issue of trying to understand why a complex AI system makes the decisions that it makes. Bharitkar points out the industry is producing techniques to address these issues, but they are likewise works in progress, particularly in terms of how best to apply them to particular creative processes. He points, for instance, to a technique known as “layer-wise relevance propagation,” which is essentially designed to visually explain a neural network’s output within the domain of its input, sometimes with a pixel map that illustrates what pixels contributed to the machine’s determination and how they connect to each other.

“This method basically visualizes deep learning model decisions,” Bharitkar explains. “The hope is to better understand why the model is giving us such results so that it will be less of a black box.”

This overall challenge, though, relates to the importance of having “domain knowledge” when feeding data to AI models, as Bharitkar puts it, meaning it’s important to have specific knowledge of the various types of data you enter into the system. There is, for example, time domain data that analyzes mathematical functions, patterns, or signals with respect to time. Frequency domain data, by contrast, analyzes those patterns with respect to the frequency band being used, and there are others.

“The best solutions come from integrating domain knowledge for AI techniques,” he says. “What kind of features do you want to use or what type of a model that can extract meaningful representation of the targeted data set? What kind of pre-processing do you do before you feed the data to the AI model? And there is even more to it than that. When you start doing pre-processing of audio data, you need to take human perception into perspective, as well—how the human ear is organized, among other things. So, you need domain knowledge expertise, signal processing skills, and acoustic and auditory perception information combined with your AI system in order to come up with a good solution.”

Regarding the issue of integrating human perception into an AI model, Bharitkar adds that researchers often apply signal processing techniques that utilize Head Related Transfer Function (HRTF) data, as previously discussed in Newswatch. HRTF essentially involves trying to customize a signal based on the shape, properties, and movement of a typical listener’s ears, head, and other biological characteristics.

The problem with the process, however, is that it is hard to build an HRTF model that works for most people, let alone everyone, since biological characteristics widely vary from person to person. In 2019, however, Bharitkar published a paper on how to use AI techniques to model various differences into an HRTF, and essentially come up with what he calls “a base, fundamental HRTF that can work reasonably well for a large population set. For that, I had to acquire large data sets of various HRTF’s and then create an AI algorithm to come with a baseline.”

“That’s an example of how you can map human perception when creating audio signals, and fuse it together with AI tools,” he emphasizes.

Bharitkar elaborates that there are other areas in the audio world currently either benefitting from AI technologies, or having the potential to do so. One of them, for instance, is the use of AI tools to help design object-based home listening systems to account for a range of environmental conditions. Mostly, he adds, more traditional computer vision techniques are used in this area today, but AI’s potential to assist is clear.

“Several AI models have been developed with computer vision in mind,” he says. “Not so much with audio as with picture right now. But, of course, there is an area called transfer learning where people take existing models and map them to use for other, related applications—in this case, it could be applied to audio and speech. But these are all areas of research.”

Bharitkar adds that as AI marches forward, industry standard work will follow in various categories. Indeed, that work has already started. He points to a new non-profit organization founded by Leonardo Chiariglione, the longtime head of MPEG, called MPAI, which stands for Moving Picture, Audio, and Data Coding by Artificial Intelligence. MPAI, Bharitkar says, is already hard at work pursuing a path for standards related to neural networks and AI-based models for certain applications, including video, audio, and speech.

He cautions that the Covid-19 pandemic did, in certain ways, slow down the process over the last year. But nonetheless, he is anticipating exciting developments in the coming months and years.

“We have to do listening tests based on algorithm outputs in our labs,” he says. “The pandemic put that behind schedule as we need to be in the lab for those. That has been an industry-wide problem when you need to do experiments that require sophisticated equipment in a lab. But [researchers] have high-powered GPU’s at home to run AI algorithms on, so in terms of algorithm development, that work is still going on at a reasonable pace, and I expect things will get back on schedule soon for all of us working on these issues.”

SMPTE Spotlight: Joe Addalia

SMPTE at the 2024 NAB Show

High Dynamic Range (HDR) Instructor Led Course

Media in the Cloud - Instructor Led Course

SMPTE Spotlight: Joe Addalia

SMPTE at the 2024 NAB Show

High Dynamic Range (HDR) Instructor Led Course

Media in the Cloud - Instructor Led Course

AI for Audio Content Creation

Michael Goldman

Related Posts

SMPTE Spotlight: Joe Addalia

SMPTE at the 2024 NAB Show

High Dynamic Range (HDR) Instructor Led Course

Media in the Cloud - Instructor Led Course

SMPTE Spotlight: Joe Addalia

SMPTE at the 2024 NAB Show

High Dynamic Range (HDR) Instructor Led Course

Media in the Cloud - Instructor Led Course

AI for Audio Content Creation

Michael Goldman

Related Posts

A New Compression Paradigm

The CBC Deploys Automated Media-Over-IP in its New Montreal Production Center

SMPTE 2020 - An Immersive, Remote Experience