SMPTE Newswatch Masthead

Hot Button Discussion

LipSync Progress 
By Michael Goldman 

In the world of technical broadcast standards, 2013 showed major progress. 2014 promises to significantly extend that progress, as it relates to a problem that began vexing content creators almost from the first day that sound was added to moving images, and then become far more complicated once the digital broadcast, multiplatform era took over. That longstanding problem involves how to seamlessly synchronize audio and video program signals from the moment of acquisition to when that content reaches viewers. This challenge, of course, is better known as the "lip-sync issue."

Veteran broadcast industry consultant Paul Briscoe, Chair of the SMPTE Ad Hoc Group (24TB-01 AHG Lip-sync) tasked with finding a practical, interoperable lip-sync broadcast standard at long last, emphasizes the irony that lip sync became a significantly more important issue as digital broadcasting technology marched inexorably forward.

"With film, at the instant of capture, they solved it simply by using the slate in view of the camera," Briscoe explains. "In television, early systems captured audio and video, and they were immediate systems, very direct. In those early days [of broadcast television], sound and picture from the studio was hundreds of microseconds away from the viewer - the camera scanned the picture, the microphone picked up audio, and it all went through a simple electronic system via a transmitter to a receiver to a screen and speaker in your living room in literally no time at all. So there was essentially no opportunity for lip-sync errors to occur."

"However, technology has advanced greatly, and now we record audio and video in television, and that gives way to editing. As soon as we start editing, we have the opportunity to separate audio and video, handle them discretely, and then put them back together later. Combine that with the fact that digital broadcasting systems have become insanely complicated. We have gone from a state where audio and video from a studio or videotape recording was presented to a transmitter and then sent to the viewer almost instantly, to today, where the studio or recording device can put out video and audio totally in sync. But, it then sends it through a complicated system that involves video and audio compression and decompression, storage on media, often in compressed formats, with various types of processing, including switching, frame-rate conversion, format conversion, different codecs, and much more. All these things bring the opportunity for video and audio to rip apart in time. And that does not even include the total lack of management of lip sync on the IP infrastructures we are evolving toward today."

Nor, Briscoe adds, does it include further potential processing complications on the consumer end, from different satellite, cable, and IP pathways to a veritable orgy of set-top box technologies to virtually unlimited home-theater and media player wiring configurations that can further compromise the chances of video and audio coming out properly synchronized. This reality, he emphasizes, is not merely an annoyance - it is something that "grates on our brains," and thus, it has evolved into an issue that can interfere with getting business done efficiently from the broadcaster's point of view. Unsynchronized signals can interfere with program enjoyment, the suspension of disbelief, concentration, and perhaps most importantly, the ability to absorb advertiser messages.

"Impact on the audience can vary, but [studies have shown] that humans have a directional sensitivity," Briscoe relates. "At a ballgame, because sound travels slower than light, you see the ball hit the bat many milliseconds before you hear it. This is normal, and our brains are programmed to accept this. Therefore, we tend to accept late lip-sync errors a little better than early ones. Because our brains have no reference or model for when the sound arrives before the picture, it makes us uncomfortable. Once we notice these kinds of lip-sync errors, there is a hysteresis effect, meaning that once you have seen it, you cannot easily unsee it. Your brain will continue to remain suspicious, even if it only happened for a moment. Your ability to engage the fantasy of the program is interrupted by the conscious realization of a real-world technical problem. There are studies that indicate that lip-sync errors create loss of believability, and advertisers hate anything that makes their content less believable."

As high-definition, and now UHDTV and other picture-quality improvements have evolved, the problem has gotten worse, Briscoe adds, because now it is possible to notice such errors easily and in more detail. The problem exists at all levels, "simply because of a lack of standards," he says. "We have many measurement technologies available today, but they are proprietary, and many are done manually by human activity throughout the production and distribution chain. There needs to be a single interoperable measurement method to properly address the issue." 

Briscoe states that lip sync can be measured in two basic ways. First, broadcasters can use an "out of service" approach, by which they take video or audio channels out of use entirely and examine test signals. He compares this to putting a car up on a lift. All can look well when the vehicle is on the lift - you can work on the car, but you can't drive it anywhere.

The second method involves "in-service measurement," more along the lines of [performing] diagnostics while driving the car - using proprietary devices to analyze live signals. Briscoe says there are as many as a dozen manufacturers who make technology for automated in-service measurement, "but they all do it in their own manner. They are all proprietary, and many are protected by patents, which precludes or at least discourages other manufacturers from using that technology to build compatible products."

Such hardware tends to be "task-specific box or modular solutions; usually they are big and they don't work with each other," Briscoe states. In other words, interoperability is currently not part of this paradigm. By contrast, he says, the Ad Hoc Committee decided any standard had to ensure interoperability as its foundation.

"We want something you can use in-service, and of course, if you can use it in-service, you can obviously use it out of service, as well, at any time, for any content, on air, or off," Briscoe says. "We want something medium agnostic - we don't want the solution to care how the signal is being conveyed or stored or transported. We also want a low degree of complexity - something you can put on any port of any device, and something that can be interoperable between manufacturers. We also want it to be able to make it through all kinds of signal processing without being damaged. It also has to work throughout the chain, and not just upstream in the production and presentation end - something that can make it to the home and be used there. SMPTE's domain doesn't go that far. We stop at the edge of that media ecosystem. But if we can come up with something sufficiently compelling, the consumer electronics world will see there is something that works, and will eventually write the standard into their equipment."

The working group has diligently examined two basic methods for building this foundation. The first approach is the so-called "watermarking" technique, which is basically what most of the aforementioned proprietary systems currently use in one form or another. The idea there is to include "something invisible and inaudible" in the video and audio streams that can later be extracted, Briscoe states.

"That requires complicated electronics and software, and ultimately, the very kind of complicated algorithms that are often protected by patent as we've seen [in proprietary systems]," he says. "The biggest weakness of this approach is that when you process the video, say by taking a high-definition picture and down-converting it to a standard-definition picture, the watermark can often become lost. When you downscale the picture to lower quality, bits in the watermarking can get washed away. Watermarks also do not always co-exist well with other watermarks and, in some situations, they can be removed."

Therefore, the Ad Hoc group has focused its efforts on a different methodology-"fingerprinting." This method involves analyzing the signals for uniqueness after a criterion has been established for audio and video.

"In this case, we look for changes from frame to frame in the picture, and some characteristic in the audio that is time varying," Briscoe explains. "The nice thing about fingerprinting is that it does not modify the content. It looks at picture and sound, makes decisions, and derives bits of fingerprint data. That fingerprint data is then transported along with the media downstream as metadata, much like captions."

Briscoe says fingerprints, if attached as metadata, "are particularly robust and can go through many stages of processing, up and down conversion, and other typical picture destroying [factors], even picture cropping, for example."

He elaborates that the concept begins by representing in the metadata the relationship between video and audio at a time when synchronization was known to be correct. From that point on, processes regarding how to measure delays or differences, what to do about them, and when, need not necessarily be part of some universal approach. If the standard is interoperable and built into hardware and software up and down the chain and, eventually, into consumer devices, broadcasters can develop protocols about when to search for differences, how to flag them, when to fix them, and when to simply monitor them and leave them for correction at some later stage, even someday upon delivery to the consumer's viewing device. The important thing is simply that the beginning and end points be handled correctly, and that the format and method for attaching and transporting lip-sync metadata be built upon a singular industry standard, Briscoe suggests.

The SMPTE group expects to finalize its work in 2014 to make launching the standard possible, and Briscoe says the state of this work is well under way. The first job involved establishing algorithms to generate video and audio fingerprints, and that work has been done. An algorithm to analyze video content on a frame-to-frame basis and score differences based on established picture criterion has been developed, and similarly, so has an algorithm for comparing audio magnitude at any particular moment to long-term averages on a sample-to-sample basis. "Those two algorithms have been tested extensively," Briscoe explains. "They are particularly unbreakable."

Next, the group specified how metadata would be formatted by designing what Briscoe calls "little containers into which fingerprints from audio and video for each frame would go. Along with a few other helper bits, which help the system to work, this fingerprint container is associated with each video frame - it's a constant stream of fingerprint containers."

The group then defined how such containers could be transported. Basically, they defined three primary streaming domains in which the containers could swim - SDI, the industry standard for production; MPEG-2, widely used for compressed distribution of program content to end users; and via IP protocol for IP networks, the Internet, or small LAN networks where, increasingly, compressed media is being more frequently transported today.

One piece is not yet finalized - the so-called "file binding" piece on how to attach the metadata to files. This area is consuming most of the committee's current active work, and largely revolves around what file formats on a landscape of endless file formats are worthwhile for linking to lip-sync containers. Native MXF files will likely be one of them, since MXF is a clear professional standard, and many others are under examination, with the goal to make file-binding as agnostic as possible. Briscoe does not have a forecast for how quickly this remaining work will be done; however, he emphasizes the other aforementioned stages, about to be finalized and published, were the more complex stages, and so is confident that "a solution is at hand if we spring ahead a matter of months."

Finally, Briscoe emphasizes, the Ad Hoc Committee has established liaisons with the Consumer Electronics Association (CEA) and other outside standards development organizations to "build critical mass" for the looming standard in hopes that the electronics' industry will broadly get on board and work toward what he calls a future "broadcaster's Nirvana" in which an end user's device will be able to routinely and automatically synchronize signals that aren't already synchronized by the time they get to the home.

"The hope is [that] they will see what we have done and pick up our standard and feel that, together, we can solve the problem all the way to the end stream," he says. "That's what we are working toward."

Click here to view a recent SMPTE Standards Update Webcast on this subject featuring Briscoe discussing these issues, and other recent lip-sync-related developments.

News Briefs

Faster Satellite TV  
TV Technology recently reported on a new research project aimed at developing a data-coding technology for potentially doubling bandwidth for satellite Internet connections. The report says the research, conducted jointly by MIT's Lincoln Laboratory and Ireland's Hamilton Institute, revolves around developing a variant of the Transmission Control Protocol (TCP) method of transmitting data, whereby mathematical equations replace the traditional "handshake" necessary for data packets to travel to their destination correctly - an ongoing problem for satellite transmission, because packet loss and latency is greater with that method of data delivery. The article says vigorous testing of the coding technology is set to get under way in 2014, with the hope that clean video transmissions via satellite IP, with far less latency, will result. The article has links to earlier work that the research is building. Download the researcher's paper here.  

Standardized Camera Reports   
The Visual Effects Society (VES) has taken the lead in attempting to find a solution to a growing challenge as databased production and post-production, particularly visual effects, become increasingly complex in modern movie-making. That challenge involves the issue of acquiring and maintaining 100% accurate camera data and measurements from the set all the way through the visual effects process. The VES argues there has never been a fundamental standard for an industry-wide camera report format, and so, they are working to change that. Their first step is to base the standard in the Filemaker 11 database CSV format, simply because Filemaker is readily available at little or no cost as a mobile device application, and thus is widely accessible at all levels of the industry. The VES project will define a standard set of fields and format specifications for importing and exporting data in a way that would work for film, digital, 3D, and multiformat projects. For more, here is the VES website on the project, and here is some detailed analysis from FXGuide.

Bidding Film Farewell
A great irony this awards season illustrates the radical transition of the motion-picture business from film-based acquisition and distribution to all-digital universe. The 2014 Sci-Tech Award nominations included this year, for the first time ever, a special Oscar statuette presentation slated to be made at the event's annual dinner in mid-February directed not to a particular individual, company, or specific technological breakthrough. Rather, the Academy announced it would be collectively honoring essentially all film laboratory employees throughout the generations "who have contributed extraordinary efforts to achieve filmmakers' artistic expectations for special film processing and the billions of feet of release prints per year." The group award, explained in this recent Wrap article by Steve Pond, which also lists all the other 2014 Sci-Tech nominees, is obviously a tribute to the film medium's century-plus years of service to motion pictures at the very time when that technology is rapidly being phased out. The irony of that reality was never more obvious than this year when, simultaneous to the Sci-Tech honor, all of the Academy's Best Picture nominees were shot using digital camera systems, as explained in this NoFilmSchool article, which highlights an interesting set of graphics created for the Setlife Magazine Facebook page that details the technical specifications for all of this year's Oscar-nominated films in major categories.