Speech Recognition for Learning — Coming to a School Near You Soon?

In the Beginning
If you’re like me, you may have been wondering why Airlines can take your reservations over the phone via a speech recognition engine but, at the same time, people keep saying the technology has a ways to go before it will be anything useful. The reason is that speech technology is a broad field and some areas are maturing rapidly while others lack the sophistication we would like to see.

First, the good news — the technology for task-based speech applications in learning is here. Yes, there are still a few catches but we should begin seeing some widely deployed Web-based voice learning applications in 2004.

Now for the bad news — the road to development is not as clear or clean as you might like and there is a learning and/or investment curve.

Still, if you’re excited about the possibilities of speech applications in education or you simply believe there is good financial ROI, read on!
Competing Directions
To begin with, don’t mistake me or this article for one of those over-optimistic assessment of natural language speech recognition technologies. We’re still a ways out from that type of almost full-blown-AI speech recognition (although the military is doing some interesting work these days on translation tools).
Nor am I particularly interested in continuous speech recognition as it is used for dictation. What we’re talking about here is speech application technology that uses limited grammars, with varying degrees of accuracy required, to let humans use voice input over the phone or the Web. You may have used such technology while making a plane reservation or accessing your voice mail. A friendly voice asks you for your name or account number and then responds to that information and other voice input accurately and without the need for the presence of another human being on the line. You get what you want the way you always have — i.e. by talking — and the company saves money by hiring less staff to answer phone calls.
These systems are limited of course. Just try to ramble on about how awful the weather is when the system asks for your location. “I’m sorry ma’am. I am not familiar with that city or state. Could you please repeat….?”
Competing Technologies
Until recently, most efforts involving speech recognition were associated with telephony. Notice that in the example above it was assumed that you were on the phone providing input. That’s because the prevalent technology for this type of application has been VoiceXML. VoiceXML is a markup convention that allows speech input received by a telephony system to be translated into data on a server and then compared to responses in specialized grammars for appropriate responses of for storage. The VoiceXML forum was begun in 1999 by AT&T, Lucent, Motorola, IBM, and seventeen other companies. It released its first standard that year and currently it has evolved to the newly proposed VoiceXML 2.0.
Having been around longer than other standards, VoiceXML is a tested protocol with plenty of users. It scales well and is also handles voice input accurately and efficiently. It is also a fairly flexible standard.
The problem with VoiceXML is that it is based on telephony technology. This means you must either be hosting such technology within your own system (like your own pbx), or be willing to pay for one of the VoiceXML hosting services available. And, since these services were developed with companies like American Airlines, they don’t come cheap.
VoiceXML applications also require more hardcore programming that newer technologies. Once again, this works out well for companies like SpeechWorks and Nuance that have developed robust software platforms for VoiceXML application development.
Perhaps the best thing about VoiceXML for educators is that it has been successful enough with big businesses that it has spurred interests in other standards. 43 percent of North American companies have either purchased interactive voice response software for their call centers or are conducting pilot studies, according to Forrester Research. Today’s $500 million market for telephone-based speech applications will grow—reaching $3.5 billion by 2007, according to Steve McClure, a vice president in the software research group at market analysis firm IDC.
The current alternative to telephony and VoiceXML is a new technology standard being promoted by Microsoft — SALT (Speech Application Language Tags). SALT handles speech in much the same way as VoiceXML, but it runs over a Microsoft .NET server platform and bypasses the required telephony component. Also, as a true Web-based markup language (VoiceXML is a markup language that merely bridges telephony applications to Internet Protocol) , SALT requires a smaller learning curve and much less programming talent (Web developers can work with the tools with which they are accustomed). Voice Web Studio, for example, is a product designed for Macromedia Dreamweaver™ MX that enables anyone to visually build and deploy speech-enabled Web and telephone (IVR/VRU) applications based on SALT 1.0.
The other real attraction of SALT is that it is “multimodal,” meaning it can integrate speech with other media such as text, graphics, and video. This is ideal for first-generation learning applications that might mix TTS (text to speech), voice input, graphic hotspots, and text input/output.
The two problems with SALT are that it is relatively new (the SDK is in Beta 2.0) and it is developed by Microsoft. That and the fact that most development is being done with large enterprise customers in mind, means that it is an IE dependent technology (it requires an IR plug-in) at present. Other vendors have developed SALT-compliant browsers but it’s a dark world out there when it comes to reaching multiple OS platforms with SALT.
Finally, the fact that telephony equipment is not required does make SALT cheaper but it still requires a Microsoft .NET server platform which is not free.
The Coming Year
SALT will continue evolving this next year and is the best bet for some pretty nice learning applications. Look for products developed for elementary students as well as language learners that combine text, audio/video, and speech. For example, students are required to provide voice instructions to move from one text scenario to the next. Or, students are presented with a series of visual options and make their selections using voice.
Both VoiceXML and SALT allow the development of accurate grammars for these applications and both protocols let you control how sensitive the system should be when comparing the input speech to the grammars (on a scale of 0%-100%). This makes it easy to control the filter of an application depending on its purpose.
These may seem like trivial applications to some, especially with the expense and learning curve involved. But these voice-input applications focus on a need that the Web has difficulty addressing — oral learning styles. Speech applications, or at least speech alternatives for existing applications and pages, allow us to add functionality to our Web sites that makes them work for everyone. That’s a promise I’ve been waiting for.
Looks like I won’t have to wait much longer.

Share, bookmark or tag: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • blogmarks
  • del.icio.us
  • digg
  • NewsVine
  • Reddit
  • StumbleUpon
  • Technorati
  • JeQQ

1 Response to “Speech Recognition for Learning — Coming to a School Near You Soon?”


  1. 1 Rob Reynolds

    Thsi is a fantastic article, Rob, with very helpful sources. Great information. It shows there is hope within the font-trap… I had the opportunity to meet one of the developers of Magic Gooddy — a Russian-English / English-Russian voice-recognition / translation software program that was, at the time of first seeing it in Uzbekistan, pretty amazing. That was in November 1998 - later, I meveloper in St. Petersburg, Russia — he was part of a large software development firm. I’m not sure if it collapsed along with others in the dot.com collapse — one of the survivors is PROMT http://www.e-promt.com/indexe.shtml — legacy of Cold War research initiatives.

Leave a Reply