In the April issue of the SMPTE Motion Imaging Journal WarnerMedia’s Ha Nguyen, Aansh Malik, and Michael Zink present their paper on “Exploring Real-time Conversational Virtual Characters” using artificial intelligence. They survey how advancements in AI – including modules for Speech-to-Text (STT), Natural Language Understanding (NLU), Natural Language Generation (NLG), and Text-to-Speech (TTS) – can enable relatable and multi-dimensional Virtual Characters who can naturally converse while maintaining their personality in the narrative. Using a conversational framework with interchangeable and loosely coupled components, the system is designed to support granular creative details in character performance, efficiency in the mass creation of Virtual Characters, and the flexibility to work with future improvements.
The authors note that humans have always used stories to make sense of our world and to share that understanding with others. While innovative technologies in natural language processing and real-time computation have changed how stories are created, storytelling still requires compelling characters around which the narrative revolves. While the integration of the right technology components can give birth to real-time conversational Virtual Characters telling their own stories, those technologies must work in synergy to create characters who can exhibit personalities and hold believable conversations.
The WarnerMedia team explored this conversational framework by creating Melodie, a Virtual Character who is fond of music, and is a fan and promoter of the Eurovision Song Contest. Using this character, they completed the full cycle of conversational interaction with another speaker using components which:
The components in this framework are interchangeable and powered by existing, proven technology solutions. At each stage, the technology components furnish new possibilities as well as challenges, including ethical considerations surrounding conversational Virtual Characters.
The three major Cloud service providers – Amazon, Google, and Microsoft – all offer Speech-to-Text services using AI, deep learning neural networks, and massive amounts of data to train language models for these conversational and dictation scenarios. These language models can then be customized and augmented as needed for each application's unique context of acoustic, language, pronunciation, ambient noise, and specific vocabulary. The leading cloud services also offer Natural Language Understanding, allowing them to predict intent and overall meaning, and then to extract relevant detailed information from the transcribed text, including entities, keywords, sentiments, and emotions. These off-the-shelf STT and NLU services eliminate the need for in-house AI expertise and complex AI model training, while still providing a powerful toolset with scalable computing power that can be applied to Virtual Character conversations.
Testing and analyzing the implementation of Melodie brought forth several ethical considerations for the design of applications involving Virtual Characters. The team notes that, whenever AI systems are designed and deployed, they bring with them concerns including privacy, bias, and discrimination. They foresee the need of evolving ethical frameworks to coincide with the evolution and advancements of the components of the technical framework.
For greater detail about this fascinating subject, read the complete article in this month’s SMPTE Motion Imaging Journal #virtual characters, #virtual beings, #conversational AI, #artificial intelligence, #conversational characters, #artificial intelligence, #AI, #STT, #TTS, #NLU, #NLG, #deep learning, #neural networks