Automating QC for Language and Captions

The views and opinions expressed in this article are those of the author and do not necessarily reflect the position of the Society of Motion Picture and Television Engineers - SMPTE.

This paper was originally presented by Drew Lanham and Ken Brady at the 2013 HPA Tech Retreat, February 18 – 22.

For IP-broadcasting, Advances Coming in QC Methodologies for Captions, Descriptions, Languages, and More

On a digital broadcast landscape littered with multiple channels, platforms, devices, formats, delivery options, and most importantly, raw data, the issue of how to most efficiently apply closed captions, subtitles, languages, and other metadata to different versions is a problem that has grown exponentially in recent years. Literally millions of files, possibly billions, directly related to online broadcasting applications are traveling through servers and into consumer homes each day, with individual assets potentially having dozens of mezzanine files associated with them for different versions and languages. Now, the FCC has applied stricter rules regarding the captioning of Internet video, directly impacting the requirements for distributing such video over IP networks. 

The 2010 Communication and Video Accessibility Act (CVAA) has brought increased regulatory scrutiny regarding captioning on longform content to virtually every website and content distributor. Industry pundits expect this scrutiny to broaden in the coming years. Therefore, the challenge of how best to quality control (QC) all versions of all the programming distributed to different destinations has led manufacturers to explore new ways of automating the captioning, subtitling, and language pieces of the QC chain. New developments in this area are under way— designed to put an end to the need to manually check millions of files at various stages, improve efficiency, save resources, and improve quality and consistency in the final product. 
 
Traditionally, in the tape-based world, broadcasters essentially “threw bodies” at the problem—manpower tasked with making sure that captions, languages, and other metadata associated with longform programming originally created for broadcast or cable television were converted correctly for IP distribution. Now, such broadcasters operate in a file-based world. Too few resources combined with too much data virtually guaranteed that human error and poor quality compromises were going to result, which has occurred over the years—the process has become unwieldy. Broadcasters now have even more data, more files, stricter rules to apply to those files than ever before, and increased risks of being fined if they fail to apply the rules correctly. Thus, the need to automate has evolved into a pressing issue. 
 
As a result, the industry is now trying to attack this problem in ways that can generally fit into existing workflows and at the same time complement existing third-party tools for data ingest and management at various other steps in the process. Indeed, while many digital tools have automated parts of the QC process for years, closed captions, video descriptions, and caption or audio language issues were not among them. The reason being that sophisticated software had to be developed that could accurately analyze captions, match and time them, do likewise with the language identification problem, and repeatedly perform these tasks throughout the digital journey of such content—upon receipt of content, after file transfers, after editing, after transcoding, and before and after broadcast or distribution. 
 
Such tools now exist, allowing systems to test media content for video descriptions as well as closed caption and language verifications. Based on the results of those tests, captions can then be automatically realigned, and audio tracks can be restacked as necessary. Manufacturers entering this new arena are striving to verify caption, language accuracy, and quality at every stage of an asset’s life—from creation to archiving, through a methodology of analyzing program audio and secondary audio programs (SAP), comparing them, identifying places where there are discrepancies, and automatically adjusting them.
 
The primary objective of video description verification is to automate the process of verifying that video and audio descriptions are present in the metadata and match original program audio. It is also important that those video descriptions comply with regulatory requirements for content distributed over IP platforms. The approach being taken to accomplish this is to have the software compare the main program audio track to the video description audio track, identify places where video descriptions are present, and generate a “confidence score” that represents the probability that the video description is both present and associated with the correct piece of audio. 
 
The process is similar for closed caption verification. There, the objective is to verify the captions correctly represent the dialogue in the audio, that the caption timing is correct, that any problems with the caption file (missing or incorrect captions, etc.) are identified and corrected, and that the resulting captions comply with regulatory requirements. Thus, the software is designed to compare existing captions to words spoken on the audio track and then calculate the level of caption coverage by comparing the intersection of voice activity and the aligned caption segments. This verification process is based on user-configured parameters that adhere to acceptable levels of caption accuracy, timing, and coverage.  
 
With language verification, the software attempts to verify that the correct language is linked to the correct audio track, verify the correct language is associated with a caption or sub-title file, and identify any places in the audio or text where language metadata is missing. In this case, users input parameters for possible languages and then each audio track is analyzed and, as before, a confidence score is generated that declares the probability of each language being present in the expected location. The expected language is compared with the detected language based on the user’s parameters, and adjustments can then be made when problems are detected. 
 
One of the primary corrections such new technologies deal with as it relates to closed captions is the realignment of captions. In this area, developers are focused on automatically identifying and correcting caption drift caused by improper frame rate conversion of the material, making sure the file has the correct caption timing delays for live-captioned content, and creating new caption files from untimed transcripts. The methodology for doing this involves matching dialogue in the caption file to words in the audio track, adjusting caption timecodes to match timing of the audio, and then inserting those timecodes into the transcripts so that new caption segments can be created. When this is done, a new caption file is generated with correctly aligned captions. 
 
Manufacturers have found it useful to build such verification tools on the foundation of RESTful API software architecture, a common standard, and the use of XML-based file protocols. That potentially permits their software to run through any interface, and to link with and be initiated or even controlled by third-party products in the video ingest and conversion space, such as Amberfin’s popular iCR product, and others. 
 
Such technology is just now rolling out to broadcasters, so data on the effectiveness of this approach will not be available for quite a while. It is also possible that there will be changes to the approach and available products, and new ones will evolve over time. The goals of this strategy, however, are now well established and solutions for the best ways to satisfy those goals have advanced greatly. 
 
In summary, those goals are basically to automate QC/compliance processes that have traditionally been manual. These processes should address scale challenges to cover all assets in the increasingly file-based world in which broadcasters now find themselves. This should help avoid costly mistakes that could result in regulatory fines and setbacks; and with increased automation, free up resources and manpower for other high-value tasks. But mostly, at the root level, the initiative is aimed at improving the quality of content and the viewing experience of watching that content in alternative, non-traditional situations, such as IP-based viewing. In a broadcast world that has left tape behind in favor of data, industry experts suggest these developments are long overdue, and expect this to be one of the main growth areas in the broadcast data management arena for the foreseeable future.
 
 
The views and opinions expressed in this article are those of the author and do not necessarily reflect the position of the Society of Motion Picture and Television Engineers - SMPTE.
 
Copyright 2014 the Society of Motion Picture and Television Engineers, Inc. (SMPTE). All rights reserved.