AI Enables Generative Audio

As discussed in 2020 and 2021 in Newswatch, exciting developments are occurring in the world of machine learning/artificial intelligence when it comes to certain aspects of content creation generally, and audio content creation, in particular. However, until recently, many of those developments were either computationally challenging and not always practical, still in early theoretical stages, or were geared to achieving technical, rather than creative, solutions within the larger content creation chain, such as speeding up encoding and compression, for example. But Zohaib Ahmed, CEO at Canada-based Resemble AI, suggests that more creatively oriented, yet practical, applications for AI technology are now starting to take off. In the case of Resemble’s work, such breakthroughs are coming in the area of audio creation related to mimicking or replicating human voices and other sounds at the creation or recording stage.

The idea, says Ahmed, is to allow content creators to utilize intelligent software algorithms to create “conversational experiences” for media content without having a great deal of technical expertise or infrastructure for doing so. He says the concept builds on, but goes further than, how AI technology has been used in the corporate world to create so-called “virtual conversational assistants” for phone, computer, or social media use.

“We want to give [content creators] creative freedom, to let directors or producers and others have control over the output, to iterate over the output, just like a visual effects person does when they change the way smoke appears or the way ice breaks in an image, and to let them do that simply with the click of a couple buttons,” Ahmed explains. “We can use AI by taking even simple, or limited, data sets to create a human voice model, and then ‘direct’ the performance, even getting the voice to speak in languages the real person or actor couldn’t speak in, and so on. I think this is just one example of how we can apply AI to entertainment-related things to unlock all kinds of creative opportunities.”

Resemble developed AI technology to create what it calls “tunable synthetic voices with low-latency generative AI modeling”—a user-friendly approach useful for non-media-related applications at first, such as interactive voice response (IVR) for virtual assistants and spokespeople. Ahmed says this was a good place to begin considering most AI development in this space had previously revolved around things like speech recognition and dialogue intent mapping, and not so much on the development of realistic voices. But entertainment-related applications were always looming on the horizon, he adds.

“When I first got into this, not a lot of focus was being given to audio, voice, and speech, and I thought there could be a lot of opportunity in that area,” he says. “We felt that it would be possible to use existing architectures and machine-learning techniques that were already widely in use for computer vision and graphics and apply them to audio.

“I remember attending a meeting at Pixar and watching a presentation on their visual-effects process, and the tools they had for that work. I asked someone what they did for voiceovers, for speech? And they said, of course we use a studio and record voices and then use normal audio engineering tools to edit the audio. I said, ‘what if you don’t like the audio?’ They said, ‘then we record it again’ and keep going until we do like it. Those kinds of early conversations validated there was a need for this kind of technology.”

Since that time, Resemble’s technology and approach has been used on several projects, most extensively on the recent Netflix documentary, The Andy Warhol Diaries. For that project, as recently explained in a Wired article, filmmakers had access only to 3 minutes and 12 seconds of audio data of the famed pop-art icon’s actual voice. Yet, they were able to clone Warhol’s voice in a believable way at a scale large enough to have “the character” literally read about 30 pages of text from the real Warhol’s actual diaries over the course of six episodes.

“That was our first big use case,” Ahmed relates. “It took about a year and a half to get there, with lots of challenges, including the limited amount of data we had available to us. But the creative freedom that allowed the filmmakers to type in what they wanted him to say, click a few buttons, and create accurate speech in the way they wanted it delivered was important. It was as if [Warhol] were standing in the studio and the director was talking to him, telling him how he wanted him to pronounce his lines, including emphasizing certain words, pausing in certain places, having emotion in his voice, and so on.”

Ahmed says this technique is called “generative audio” and comes out of a larger category called “generative AI,” which he defines as “a sub-section of artificial intelligence. Historically, AI and machine learning have been used for things like data analysis, predictive analysis. Now, we have generative AI, which is relatively new. But it is a form of AI that is capable of producing things from scratch, which is quite interesting. You are no longer just editing or manipulating or predicting what comes next—you are actually creating things from scratch. This illustrates that we can use AI not only to improve [automation], speed things up, or analyze or sift through data or predict things—we can use it to literally create ‘new things.’"

As to the quality of such audio, Ahmed and colleague Ollie McCarthy published a paper in 2020 about the development of what they call a “neural vocoder” that has helped improve and showcase the quality of synthetic audio. But what he thinks is really exciting on the technical side of this equation is the fact that Resemble and others are working hard to remove technical constraints that have always gotten in the way of making such AI applications practical, including what he refers to as “the feedback loop.”

“This is the area we have made the most strides in,” he relates. “One clear constraint had always been data—most people don’t have dozens of hours of clean usable data for normal machine learning models, and they don’t have the ability to easily clean that data themselves. We have been able to ingest whatever data they have, clean it up automatically, and prepare it for our AI model in ways that [the filmmakers] don’t have to be concerned with, for the most part. And, we have removed the feedback loop. Previously, you would have to wait a long time for the computations to be done to get a voice model ready. We brought that down to minutes, meaning you can do modifications to the voice, add emotion, and so on, in a [creative] session, and you can do it with the data set being needed only as the seed to get things started, while the system extrapolates the data and makes it more useful.”

Furthermore, as the media landscape evolves, particularly where it comes to user-driven, interactive applications such as the so-called Metaverse, Ahmed believes the need to develop techniques for creating, altering, and directing the performance of synthetic voices will only increase. He further sees such techniques being born for business, videogame, or Metaverse-style applications and then, eventually, becoming adopted by filmmakers and other traditional content creators.

“With the concept of the Metaverse, you have characters you create that can transition between platforms and digital worlds,” he says. “So, you might need the character to be able to function in a movie and also as an online social media presence, for example. One of the things we were lacking was voice—it was a limiting factor. Visually, you can take a character out of a cartoon and put it into a theme-park ride, but you would always have to record the voice from scratch [for a new application]. Now, you can make Bugs Bunny say something that the character never said in a film or cartoon. So, this opens up a lot of categories where these industries are blending together.

“I think it will be similar to how gaming tools work right now. We can take a blend of tools used by game developers and make them available to movie creators. You can use the same tools for a product demo, a commercial, a trailer, or even a film. That’s how I view what we are doing—creating tools that can cross boundaries. Today, a person making a voice for a call center wants that voice to be engaging and meaningful and creative in some way. That’s the same requirement a person making a documentary or a movie has, after all. So, I expect these kinds of requirements and the ability to meet them will carry over from one platform or medium to another, even if you are building vastly different experiences.”

It should be noted that, for The Andy Warhol Diaries, filmmakers had permission from Warhol’s estate to replicate his voice, and the documentary carefully informs viewers that the diary readings were artificially created. This is important because one of the potential hazards with conversational AI models is the fact that viewers may not understand what voices are real recordings and what voices are synthetic. Indeed, another Wired article from last year documents a case where the synthetic voice of the late celebrity chef Anthony Bourdain was strategically inserted into a documentary without informing viewers. Questions of the ethics of such things are serious matters, Ahmed emphasizes. The eventual limiter of such behavior, he suggests, could well be the adoption of a rigorous set of ethical standards for such work.

“Standards are a huge discussion in the world of generative AI,” says Ahmed. “I think we are eventually going to have standards, definitely. In our case, we watermark all audio files that we generate, so that we can detect ourselves if the audio was generated by our system. We certainly want to detect ‘deep fakes’ and deep-fake audio.”

He further points to an open-source project published by Resemble AI on Github to develop baseline techniques for being able to differentiate real recordings from artificial ones. “I think there is a bunch of standardization needed in terms of how all companies involved with generative AI, whether producing images, video, text, voice or whatever—how to standardize figuring out what is generated versus what is authentic,” he adds.

In any case, when it comes to generative AI, both for audio and picture applications, Ahmed suggests “things are moving very quickly—we have seen amazing work just between 2021 and now, for instance.” He fully expects the speed and quality of that trend to continue. What he hopes to see next, however, is the improved reach and practicality of such technology inside the entertainment space.

“What really needs to improve is access,” he explains. “Right now, a lot of this stuff has been living in academia, in research facilities. I think we are going to see a lot more productization of it in the next year—many more use cases where it will be feasible for someone that is non-technical to go ahead and use some of these tools to actually create something that is production ready. In that sense, I expect generative AI to start taking off in the coming year or two.”

Booth Selection for the 2024 Solutions Hub will be April 26 at 11am ET.

SMPTE Unveils 2024 NAB Show Educational Presentations

High Dynamic Range (HDR) Instructor Led Course

Media in the Cloud - Instructor Led Course

Booth Selection for the 2024 Solutions Hub will be April 26 at 11am ET.

SMPTE Unveils 2024 NAB Show Educational Presentations

High Dynamic Range (HDR) Instructor Led Course

Media in the Cloud - Instructor Led Course

AI Enables Generative Audio

Michael Goldman

Related Posts

Booth Selection for the 2024 Solutions Hub will be April 26 at 11am ET.

SMPTE Unveils 2024 NAB Show Educational Presentations

High Dynamic Range (HDR) Instructor Led Course

Media in the Cloud - Instructor Led Course

Booth Selection for the 2024 Solutions Hub will be April 26 at 11am ET.

SMPTE Unveils 2024 NAB Show Educational Presentations

High Dynamic Range (HDR) Instructor Led Course

Media in the Cloud - Instructor Led Course

AI Enables Generative Audio

Michael Goldman

Related Posts

Transitioning Media Production Facilities from SDI to IP

Establishing Clear Cloud Security Postures

Virtual Production Initiative Gains Steam