I used to work for a major telco (Verizon), and we had a number of researchers working on various voice recognition systems. I’ve also tinkered a bit with applications using text-to-speech voice synthesis software such as the AT&T Labs Text-to-Speech software , so I’m familiar with some of the issues that are commonly associated with these types of systems.
I was curious about whether Google’s clever engineers had perhaps improved upon some of the issues involved with having software recognize spoken words, and I also wondered how the quality was of their voice-synthesis.
I’ve seen lots of problems with such systems. Back when I worked for Verizon, many of our larger facilities had automated phone directories so that you could call into a central number, state the name of the individual you were seeking, and the system could automatically connect you. For simple names and straightforward caller voices, these systems worked pretty well. But, I witnessed a number of occasions when it worked frustratingly. For instance, one of the Russian-American technical directors I worked with retained a heavy accent from his home country, and the voice-recognition system constantly misunderstood him, despite the fact that the systems supposedly were built with heuristics so that they were supposed to auto-correct and improve over time. Didn’t happen.
Further, while I have no real accent at all, times when I would try to reach someone with an unusual name such as a Chinese coworker were equally frustrating. It wouldn’t matter if I used a Chinese pronunciation, nor American phonetic pronunciation — for some names the system simply wouldn’t work.
I’m also very sympathetic to disabled people for whom automated systems often offer the greatest hope for improved quality-of-life, but also often produce the greatest let-downs. My father, a dynamic and clever scientist in his heyday was eventually beaten down some by various diseases, including a stroke late in his life. The stroke took away most of the use of the left side of his body — arm and leg, particularly. After he had retired, he enjoyed using his computers a lot, but the loss of the use of one of his hands made every task far more time-consuming. His solution for this was the purchase of some voice-synthesis software, but the functionality was never all that great because his speech had been slurred some by the stroke. So, he often had to repeat commands, and correct stuff with his good hand. When he called into company service centers, the situation always seemed a bit torturous to me as well. His attempts to vocally navigate call trees were often error-prone, resulting in an even more frustrating process than the usual, never-ending maze of call trees.
Good programming is often about handling of all the exception cases and extremes that may impact any given system. Speech and understandability are complex problems, and they’re areas where fuzzy logic and adjusting recognition template tolerances may never work perfectly for all people (until or if we also can match up AI to the problem). Even so, the criteria of “how well does it function for extreme cases” should be applied to assess the quality of the system, and one would hope that it wouldn’t be easy to find places where the system fails.
To do this quick assessment, I first performed searches for stuff in Google Maps to see what results they have in their database. Google Voice Local Search is basically a voice recognition and voice synthesis interface that have been put on top of their regular Google Maps search engine and results. So, I wanted to compare the vocal search results with the browser search results.
To phone into 1-800-GOOG-411 (ignore the promotion-hyped prestige number and just use: 1-800-466-4411), I used Skype , which is a voice-over-IP internet phone service, and I loaded an extra piece of companion software, Pamela , to record the call. I was a bit nervous about this, because Skype calls can sometimes make one sound like they’re speaking from the inside of a tin can, but my connection’s sound quality seemed to be excellent. Pamela didn’t seem to start recording until the call was already underway, so the first thing you’ll hear is me stating the location for the search.
Here’s the results of my test calls into GOOG-411:
1. I called in and started out with a simple search combination. I specified the locality of Bryan, Texas. I then searched for “Plumbers”. Here’s the Google Map of Plumbers in Bryan . Heres the MP3 of my Voice Local Search . This was a good user experience. The service understood my verbal request, provided back the same search results as the browser-based Map Search, and communicated it out to me audibly in really very smooth voice synthesis.
2. Now I would challenge the system with a difficult query — one with a potentially hard-to-recognize city name, and then a challenging business name: Watanabe Yasuo (a florist) in Waipahu, Hawaii. I know from experience that Hawaii has some of the most unusual place names in the entire US, and proper names of Hawaiian natives and foreign immigrants make Hawaii a hotbed of cultural melting pots. Here’s the Google Map result for Watanabe Yasuo in Waipahu . Here’s the audio of my Voice Local Search for Watanabe Yasuo .
This was a bad user experience. Gratifyingly, it found Waipahu just fine — perhaps including the name of the state, Hawaii, helped in this respect. But it just couldn’t find Watanabe Yasuo. After repeating that name, I then reduced it down to just one term, Watanabe, and I tried to pronounce it as a midwestern American might – phonetically. Still no-go. Even more frustratingly, the service then said “Sorry we must have a bad connection – just call back and we can try this again. Good-bye!” It hung up on me! The connection wasn’t the problem, and it’s very frustrating to have it hang up on you like that.
3. I tried calling back and still couldn’t get it to recognize “Watanabe”. (MP3 ) I then used their option for typing just the first name in – which irritatingly and strangely didn’t locate it, either. I then searched for just “Florists in Waipahu “. This time it found Watanabe Yasuo as result number five. Amusingly, the voice synthesis pronounced the name pretty much the same way that I originally did. I can see this bemusing users, since they might assume that if it can pronounce the name correctly, it should be able to understand the name when pronounced the same way. Of course, the speech-recognition and voice-synthesis systems are separate, so this sort of scenario could occur frequently.
So, how did Google do on my limited test? Not very well over all. I can tell that they worked on the voice synthesis quite a bit — they’ve smoothed it out to pronounce names, addresses, and phone numbers really nicely, so I can tell they worked on fine-tuning the synthesis quite a bit. Just based upon this limited test, though, I can expect that it would be very easy to force the system to fail on recognizing place names and types of businesses or business names. I don’t have an unusual accent, so it’s disappointing that the system didn’t understand my request for “Watanabe”, even though it is an unusual name. If it fails on my voice so easily, imagine how it will work for people who have accents, slurred voices or lisps, or people who tend to speak very slowly.
This experiment could seem like a cheap shot on my part, since voice recognition is such a complex matter to accomplish. But, there’s always some high expectations when Google deploys something out, even if it is in beta release. I haven’t performed a comparison here with other services that provide automated 411 services, either, but I don’t feel that’s all that necessary, since I’ve seen a number of those services “in the wild” over time, and the GOOG-411 version isn’t remarkably better as far as I can see.
I like a lot of Google’s new services that they’ve rolled out — I’m using the Calendar daily, and I like Google Trends, Google Webmaster Tools, Google Maps, and the original Google Analytics. But, I have to give the 411 service a “C” or “D” grade. Why deploy something out, even in beta format, if you can’t do something that significantly improves upon the similar services already out there? This is a free service with no ads (so far), so perhaps the one benefit to users is that it’s a 1-800, toll-free directory assistance service. Unfortunately, the service is likely to be a disappointment for people with non-average-American voices.
Local Search is hard enough in many ways, and adding on the voice-recognition makes it even harder to produce good-quality results. So, I am sympathetic to the limitation of current technology.
With all of Google’s great work in information retrieval, I was disappointed with this – it doesn’t stand out from the crowd, and one suspects some of the other, older players out there may be doing a better job. There’s quite a bit of research going on in audio recognition and search — since 9-11, I’m aware of quite a lot of work going on through the CIA and other US government groups to automatically convert the audio streams of various “chatter” sources to text, which can then be searched upon for various suspicious word sequences. In this way, automated systems can be used to monitor input from countless wireless phone calls and landline calls, and a log of those matching on suspicious word sequences can be brought to the attention of human reviewers to try to identify potential terroristic activity. With so much funding support of this sort of research, one would hope that new vocal search services would really wow us with their quality.
I’d challenge Google to user-test this with more of the exception cases and extreme cases in order to get it functioning better for those with accents and voice problems. Disabled people need services like this to work dependably more than any other user group. Google has quite a lot of employees who are foreign-born and have accents, so they already have user groups that they could use as test subjects.
If you’re reading this and you have an accent, please try using Google Vocal Local Search, and then write about your experience on the Goog411 Group  so that they can improve the service. They’ve also very generously provided their email address  so that you can also contact them to relate your experiences.
I’d also be interested in hearing about other users’ experiences with this service, so be sure to tell us about it in the comments section of this post!