Using AI to Create Believable, Conversational Virtual Characters

Written by SMPTE Content | Apr 20, 2022 12:43:04 PM

In the April issue of the SMPTE Motion Imaging Journal WarnerMedia’s Ha Nguyen, Aansh Malik, and Michael Zink present their paper on “Exploring Real-time Conversational Virtual Characters” using artificial intelligence. They survey how advancements in AI – including modules for Speech-to-Text (STT), Natural Language Understanding (NLU), Natural Language Generation (NLG), and Text-to-Speech (TTS) – can enable relatable and multi-dimensional Virtual Characters who can naturally converse while maintaining their personality in the narrative. Using a conversational framework with interchangeable and loosely coupled components, the system is designed to support granular creative details in character performance, efficiency in the mass creation of Virtual Characters, and the flexibility to work with future improvements.

The authors note that humans have always used stories to make sense of our world and to share that understanding with others. While innovative technologies in natural language processing and real-time computation have changed how stories are created, storytelling still requires compelling characters around which the narrative revolves. While the integration of the right technology components can give birth to real-time conversational Virtual Characters telling their own stories, those technologies must work in synergy to create characters who can exhibit personalities and hold believable conversations.

The WarnerMedia team explored this conversational framework by creating Melodie, a Virtual Character who is fond of music, and is a fan and promoter of the Eurovision Song Contest. Using this character, they completed the full cycle of conversational interaction with another speaker using components which:

Capture, transcribe and process the speaker’s audio signals using off-the-shelf STT services.
Detect the speaker’s intents from the transcribed text using off-the-shelf NLU services.
Determine the character’s action or reaction from the intents in a high-level decision-making module that functions as the character’s “brain.” This component is reusable, enabling efficient mass creation of various Virtual Characters, each in their own narrative world.
Generate a proper textual response based on real-time environmental inputs, character presets, and interaction memory using an NLG module. This allows the character performance to be faithful to pre-determined narratives, pre-established character traits, and personalities, and learned knowledge over time.
Convert the response to synthesized speech incorporating the character’s Voice Font using a TTS module.
Render the character’s corresponding body and facial movements (including lip and eye movements) and synchronize them with the synthesized response. The character does not give just a generic response, but stylized one with character-specific mannerisms, word choices, and voice signature – resulting in a granular, coherent, and believable character performance.

The components in this framework are interchangeable and powered by existing, proven technology solutions. At each stage, the technology components furnish new possibilities as well as challenges, including ethical considerations surrounding conversational Virtual Characters.

The three major Cloud service providers – Amazon, Google, and Microsoft – all offer Speech-to-Text services using AI, deep learning neural networks, and massive amounts of data to train language models for these conversational and dictation scenarios. These language models can then be customized and augmented as needed for each application's unique context of acoustic, language, pronunciation, ambient noise, and specific vocabulary. The leading cloud services also offer Natural Language Understanding, allowing them to predict intent and overall meaning, and then to extract relevant detailed information from the transcribed text, including entities, keywords, sentiments, and emotions. These off-the-shelf STT and NLU services eliminate the need for in-house AI expertise and complex AI model training, while still providing a powerful toolset with scalable computing power that can be applied to Virtual Character conversations.

Testing and analyzing the implementation of Melodie brought forth several ethical considerations for the design of applications involving Virtual Characters. The team notes that, whenever AI systems are designed and deployed, they bring with them concerns including privacy, bias, and discrimination. They foresee the need of evolving ethical frameworks to coincide with the evolution and advancements of the components of the technical framework.

For greater detail about this fascinating subject, read the complete article in this month’s SMPTE Motion Imaging Journal #virtual characters, #virtual beings, #conversational AI, #artificial intelligence, #conversational characters, #artificial intelligence, #AI, #STT, #TTS, #NLU, #NLG, #deep learning, #neural networks

View full post