Startup Aims for Real-Time “Human-Level” AI Transcripts - IEEE Spectrum
Close bar

Startup Aims for Real-Time “Human-Level” AI Transcripts

Echo Labs seeks a solution with a biologically inspired approach

4 min read
Illustration showing audio to text translation
iStock

College students Edward Aguilar and Sahan Reddy are taking on one of artificial intelligence’s most difficult problems: building an AI that can recognize and transcribe speech as well as a human can, in real time.

To achieve this goal, the duo formed the startup Echo Labs earlier this year and have already raised over US $2 million in pre-seed funding. They have also been accepted into a number of technology accelerator programs, including a new data and AI accelerator out of the University of Chicago, Transform, which announced on 12 September that it will provide a total investment in Echo Labs of $250,000, consisting of cash and other resources.

For many of us, AI-powered speech recognition is already a part of daily life. It’s baked into your smart speaker and your phone’s voice assistant. Products like Otter.ai may already be translating your Zoom meetings in real time or jotting down ideas from an in-person brainstorming session.

Mark Hasegawa-Johnson is a professor of electrical and computer engineering at the University of Illinois Urbana-Champaign whose research looks at speech recognition through mathematical linguistic models. He says that even though speech recognition technology has come a long way in the past five years—and even passed a number of benchmarks to deem their transcriptions “human-level”—from a user standpoint it’s clear that there’s still more ground to be covered.

“Machines still do not generalize as well as humans,” Hasegawa-Johnson says. “For example, across different topics of conversation, different speakers—especially children, speakers with disabilities, and people with second-language accents—or different acoustic recording environments.”

It’s these edge cases that Echo Labs hopes to solve. Yet, despite Echo Labs’ current focus, Aguilar says that the origin of the company began as a clever way to get out of redundant class lectures using an application he designed called BuellerBot. At its core, this bot is made of three separate pieces: speech-transcription software to join and automatically transcribe a Zoom call, a ChatGPT prompt to generate responses to questions posed in the lectures, and a speech synthesizer to mimic Aguilar’s voice.

“I wrote some extra code that glued all that together, and then it has the ability to listen to your name, unmute all that,” Aguilar says. “So you have a little version of yourself that could join automatically and then get you out of class by responding to everything. It was great.”

Aguilar says it was his roommate, who was born deaf and now uses a cochlear implant, who recognized the possibility of BuellerBot beyond a lecture-avoidance tool. In particular, his roommate recognized that the neural network behind BuellerBot’s transcription service may have potential as a powerful tool for building speech accessibility into everyday life.

“Today, our focus is entirely on accessibility compliance,” Aguilar says. “Every university in the country [...] is required under the [Americans With Disabilities Act] to transcribe all their internal and external content at the human level,” which is a not-insignificant portion of the nearly $26 billion transcription industry in the United States.

The startup originally planned to improve the accessibility of live transcriptions via live, in-eye subtitles in augmented-reality glasses, but Aguilar says this vision has recently shifted to focus instead on the neural network itself. Now, the startup’s main goal is to develop a software application that can be directly incorporated into academic platforms—similar to how BuellerBot originally worked.

Echo Labs isn’t yet sharing exactly how it will achieve this goal, but Aguilar says the approach is a “significant departure from all existing literature.” Aguilar adds, “We’re taking a more biological approach to how to understand conversations much more holistically than anything that’s on the market.”

Exactly how novel this technology is still remains to be seen. Hasegawa-Johnson’s work, for example, has also looked at language processing holistically to interpret pronunciation variability or to disentangle confusing sentences by analyzing their stresses and rhythms. Likewise, biological inspiration is no stranger to the world of speech recognition, Hasegawa-Johnson says.

“It’s pretty clearly established that some degree of biologically inspired processing can help with the front end—separating signals from background noise and reverberation, and encoding speech-related features from the audio,” Hasegawa-Johnson says. “Biologically inspired approaches to the front-end problem have been shown to have some advantages over Fourier-transform-based front ends.”

However, while biological solutions may be beneficial, Hasegawa-Johnson says that many universities and tech giants (including Google and Meta) typically skip it and instead focus on extracting audio features from large amounts of collected data to create “learned” front ends.

“I have never seen a direct comparison of learned front ends to any well-designed biologically inspired front-end,” he says.

Time will tell exactly what role biological inspiration plays in Echo Labs’ AI, but it’s possible that a neuromorphic computing approach modeled on the structure of a human brain may be a part of it. In particular, Hasegawa-Johnson suggests that they might be exploring the role of spiking neurons in human brain processing.

“One fact about human processing that’s well-known but is not modeled by any widely deployed deep-learning system is that human neuronal networks communicate in spikes,” he says. “One possibility is that Echo Labs might be trying to apply spiking neural networks to automatic speech recognition—but that is pure speculation.”

Nevertheless, Shyama Majumdar, the director of Transform, says that if Echo Labs can pull off their mission they could have a big impact on the future direction of transcription technology.

“There is no one single entity dominating the transcription market, and the one that does it right will lead the way,” Majumdar says. “Echo Labs is in the right place at the right time, and I am confident with their ability to take this forward in a meaningful way.”

Echo Labs plans to make its next announcement in December, which Aguilar says will include more information about new partnerships as well as more details on the nuts and bolts of the startup’s technology.

This story was updated on 29 September, 2023.

The Conversation (0)