Comparing Google’s AI Speech Recognition To Human Captioning For Television News

Source:- forbes.com

Most television stations still rely on human transcription to generate the closed captioning for their live broadcasts. Yet even with the benefit of human fluency, this captioning can vary wildly in quality, even within the same broadcast, from a nearly flawless rendition to near-gibberish. Even the best human captioning often skips over words during fast speech or misspells complex or lengthy names. At the same time, automatic speech recognition has historically struggled to achieve sufficient accuracy to entirely replace human transcription. Using a week of television news from the Internet Archive’s Television News Archive, how does the station-provided primarily human-created closed captioning compare with machine-generated transcripts generated by Google’s Cloud Speech-to-TextAPI?

Automated high-quality captioning of live video represents one of the holy grails of machine speech recognition. While machine captioning systems have improved dramatically over the years, there has still been a substantial gap holding them back from fully matching human accuracy. This raises the question of whether the latest generation of video-optimized speech recognition models can finally achieve near-human fluency.

Google’s Cloud Speech-to-Text API offers several different recognition models, including one specifically tuned for video transcription. This raises the question of how well this API might perform on the chaotic rapid-fire environment of television news that can switch from studio news reading to on-scene reporting to large panels of experts talking over one another to fast-talking advertisements.

Using the station-provided captioning as a baseline, how much different is the machine transcript?

To explore what this might look like, CNN, MSNBC and Fox News and the morning and evening broadcasts of San Francisco affiliates KGO (ABC), KPIX (CBS), KNTV (NBC) and KQED (PBS) from April 15 to April 22, 2019, totaling 812 hours of television news, were analyzed using Google’s Cloud Speech-to-Text transcription API with all of its features enabled.

To test how human captioning versus automated transcription might affect machine analysis of the resulting text, both were processed through Google’s Natural Language API for entity extraction.

Google’s Natural Language API identified an entity every 6.97 seconds in the automated transcripts, but only one entity every 11.62 seconds in the station-provided captioning.

The graph below compares the average number of seconds per entity across the seven stations between the automated transcripts and the station-provided captioning.

Immediately noticeable is that the automated captioning consistently produces a greater density of recognized entities compared with the station-provided captioning. This ranges from 1.4 times more for Fox News to 2.2 times more for PBS.

The primary reason for this appears to be that the station-provided captioning is entirely uppercase, while the machine transcripts are correctly capitalized, using the linguistic capitalization model built into Google’s Speech-to-Text API. Google’s Natural Language API relies on proper capitalization to correctly identify entities and their textual boundaries and to distinguish proper nouns from ordinary words.

The significant difference between the transcript and captioning entities for PBS appears to be due to a particularly high density of typographical errorsin the station-provided captioning which both affected entity mentions themselves and sufficiently interrupted the grammatical flow of the transcript such that it impacted the API’s ability to identify entity boundaries.

In fact, examining the station-provided captioning word-by-word for each of the stations, the graph above reflects to some degree the level of error in the captioning, with the closer it matches the machine transcript the higher its fidelity and lower its error rate.

An even greater driving factor is that captioning typically does not include advertisements, while the machine transcript includes all spoken words. This means that stations devoting a greater proportion of their airtime to ads will show a greater difference.

One limitation of this graph is that it shows only the density of entity mentions, not how well they match up between the captioning and transcript. They could have similar number of entities but due to human or machine error the extracted entities could be completely different from one another.

To test this, a master histogram of all extracted entities was compiled and the Pearson correlation computed for each station between its captioning entities and transcript entities, seen in the graph below. Only entities that did not include a number and appeared at least five times across the combined airtime of the seven stations were considered.

Across all seven stations the total correlation was r=0.95, ranging from 0.96 for CNN and MSNBC and 0.95 for Fox News down through 0.75 for CBS. Interestingly, the three national stations have the highest correlations and the four network stations the lowest.

One possible explanation is that since the network stations included only the morning and evening broadcasts, the advertising airtime for these stations constituted a larger portion of the total monitored volume.

Comparing the captioning and transcripts through their API-extracted entities offers a glimpse at how their differences can affect downstream machine understanding algorithms. At the same time, capitalization and typographical errors can have a profound effect on today’s textual deep learning systems, as seen in the results above. What might the same comparisons look like when applied to the text itself?

The chart below shows the total number of unique words by station illustrating that for most stations there is a similar vocabulary between the machine transcript and primarily human-derived closed captioning. The only outlier is PBS, whose captioning has 1.6 times more unique words than the machine transcript. A closer inspection reveals nearly all of these to be typographical errors, again reflecting the higher error rate of its original captioning.

Looking at the total number of uttered words, the graph below shows that for all stations there were more distinct words recorded in the transcript than in the closed captioning, primarily reflecting the uncaptioned advertising airtime. This is one of the reasons that PBS has nearly equal spoken word counts.

The much larger number of words on CNN, MSNBC and Fox News reflects that their entire airtime for the week was examined here, while the four network stations only included their morning and evening broadcasts.

The graph below shows the Pearson correlation of the captioning and transcript vocabularies. Only words that did not include a number and appeared at least five times across the combined airtime of the seven stations were considered, leaving a total of 27,876 distinct words.

All seven stations had correlations higher than 0.989, indicating that despite their differences, the total vocabulary use of the captioning and transcripts were extremely similar.

Despite their similar vocabularies, the real test of how similar the station-provided captioning and the machine-generated transcripts are is to perform a “diff” between the two.

For each broadcast both the captioning and machine transcript were converted to uppercase and all non-ASCII letters were converted to spaces. The resulting text was split into words on space boundaries and the two files run through the standard Linux diff utility. The total number of words flagged as having changed were divided by the total number of compared words yielding a change density.

In total, the captioning and transcripts matched for around 63% of the total words, with the stations falling in a fairly narrow band from 55% similar (CBS) to 68% similar (PBS and CNN).

These percentages seem unexpectedly low given the quality of modern speech recognition.

A closer inspection of the differences explains why: the machine transcript typically offers a more faithful and accurate rendition of what was actually said than the station-provided captioning, which is typically transcribed by a human.

For example, the station-provided captioning for this CNN broadcastintroduces a set of panelists as “Dana Bash, Crime and Justice Reporter Shimon Prokupecz and Evan Perez.” In contrast, the machine-generated transcript has the actual wording as spoken on the air: “CNN’s Chief Political Correspondent Dana Bash, CNN Crime and Justice Reporter Shimon Prokupecz and CNN Senior Justice Correspondent Evan Perez” which includes their full titles.

Similarly, the very next minute of that same broadcast includes several differences, including the captioning’s “guide post” versus the machine’s correct transcription of the plural “guideposts.” Similarly, while the captioning includes the phrase “that he told me” the machine transcript correctly records that the panelist actually repeated herself, stating “that he that he told me.”

Neither captioning nor transcripts typically record speech disfluency, with Google’s API designed to ignore fillers like “um” and “er.”

This suggests that a major driving force behind this low agreement between human and mechanized transcription may be the much higher fidelity of the machine in recording what was actually said word-for-word.

An even greater influence is the fact that the machine transcripts include advertisements, while the captioning does not.

This suggests that a better comparison would be to exclude any differences involving added text found only in the machine transcript. This still counts words from the captioning that are missing in the transcript and words that are present in both but spelled differently.

This results in the graph below, showing an average agreement of 92%, ranging from 87% for PBS to 93% for CNN and MSNBC.

This makes it clear that the majority of differences between the two are the addition of advertising narration in the machine transcript and the higher fidelity of the machine in capturing details such as repetition and full spoken titles.

Looking more closely at the remaining differences, many are actually typographical errors in the human-produced captioning.

Some remaining differences revolve around certain newsreader and panelist names that the machine attempted to spell phonetically and panelist mispronunciation of names like Mueller as “mother.”

Thus, the actual alignment between human and machine is much greater than 92%.

Most importantly, the high degree of error in the human-generated captioning means it is not technically a gold standard. Thus, the 8% disagreement rate between the human and machine does not mean the machine has an 8% error rate. A considerable portion of that error actually resides in the human captioning, rather than the machine transcript.

Google’s Speech-to-Text API actually supports the use of external domain adaptation dictionaries that can provide correct spellings of specific terminology or proper names. In future, the full list of each station’s newsreaders and anchors, as well as the names of major figures currently in the news could all be added to these dictionaries to ensure their names are correctly recognized and spelled by the API.

Putting this all together, automated speech recognition has improved dramatically over the last few years. Comparing the largely human-generated closed captioning of a week of television news against Google’s completely automated transcripts generated by its off-the-shelf Cloud Speech-to-Text API, the two are more than 92% similar after accounting for the inclusion of advertising and the higher fidelity of the machine transcript.

In fact, the machine actually beats the human-produced captioning along almost every dimension, from its higher fidelity to what was actually said, its lower error rate, lack of typographical mistakes, proper capitalization and higher quality. While the tests here utilized Google’s API without any customization, the creation of a simplistic dictionary of common names appearing on each station and major names in the news at the moment would fix many of the remaining errors.

The machine transcripts still contain errors, but we are now at a point where fully automated transcripts can rival the accuracy of real-time human transcription for television news content. As these models continue to improve, it will only be a matter of time before machine transcription becomes more accurate and robust than human keying.

In the end, these graphs show us just how far the AI revolution has come.

I’d like to thank the Internet Archive and its Television News Archive, especially its Director Roger Macdonald. I’d like to thank Google for the use of its cloud, including its Video AI, Vision AI, Speech-to-Text and Natural Language APIs and their associated teams for their guidance.

Related Posts

Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x
Artificial Intelligence