Speech Synthesis & Speech Recognition
Brian Long (www.blong.com)
If you find this article useful then please consider making a donation. It will be appreciated however big or small it might be and will encourage Brian to continue researching and writing about interesting subjects in the future.
This article will look at support for speech in Microsoft Windows and see what's involved in incorporating aspects of speech technology in Windows applications. In particular we examine the Microsoft Speech API (SAPI) to see what it offers developers in terms of letting applications speak to users and also understand what users say. Since there is a lot of information (text and code) here, the article has been split over a number of pages. This page is an introduction to the subject whilst the other two pages (which have enough information to make them indivisual articles) look in detail at using SAPI 4 and SAPI 5.1 within Delphi applications.
Once upon a time the task of making your application speak or understand its user's commands was science fiction or at the very least involved lots of computing power. We will see that today's computers and speech technology enable any application to use speech technology with relative ease and good performance.
Microsoft have been researching and implementing speech technology for some years and they have an area of their Web site dedicated to the matter at http://www.microsoft.com/speech.
The speech capabilities that can be added to an application are text-to-speech synthesis (TTS) and speech recognition (SR).
This involves turning a string into spoken language that is played through the computer speakers. The complexities of turning words into phonemes, adding appropriate emphasis and translating the result into digital audio are beyond the scope of this paper and are catered for by a TTS engine installed on your machine.
The end result is that the computer talks to the user to save the user having to read some text on the screen.
This involves the computer taking the user's speech and interpreting what has been said. This allows the user to control the computer (or certain aspects of it) by voice, rather than having to use the mouse and keyboard, or alternatively just dictating the contents of a document.
The complex nature of translating the raw audio into phonemes involves a lot of signal processing and is not focused on here. These details are taken care of by an SR engine that will be installed on your machine. SR engines are often called recognisers and these days typically implement continuous speech recognition (older recognisers implemented isolated or discrete speech recognition, where pauses were required between words).
Speech recognition usually means one of two things. The application can understand and follow simple commands that it has been educated about in advance. This is known as command and control (sometimes seen abbreviated as CnC, or simply SR).
Alternatively an application can support dictation (sometimes abbreviated to DSR). Dictation is more complex as the engine has to try and identify arbitrary spoken words, and will need to decide which spelling of similarly sounding words is required. It develops context information based on the preceding and following words to try and help decide. Because this context analysis is not required with Command and Control recognition, CnC is sometimes referred to as context-free recognition.
Dictation speech recognition is speaker-dependant, meaning that because of different people's enunciation, accent, pitch and so on, recognisers require a speaker profile to be set up for decent results. This profile results from training sessions that educate the recogniser about the nuances of the speaker's voice.
On the other hand, command and control speech recognition is usually not speaker-independent.
As the incorporation of speech technology became more realistic more vendors released TTS and SR engines. Unfortunately each engine had its own API and so interchanging them was not possible. Programming for multiple engines meant a lot of recoding and the whole situation was very similar to the database API programming problem before the advent of the BDE and ADO.
In late 1995 the Microsoft Speech API (SAPI) was introduced as part of the Windows Open Services Architecture (WOSA) services. This was intended to simplify matters and has done a good job of doing so, at least relatively speaking. Depending on what you wish to do it can involve some tricky coding, but that's the same story with the basic Windows API.
SAPI is currently (at the time of writing) at version 5.1 and now professes a single API made from a set of interfaces that you can program it with to get TTS and/or SR in your application. However up until version 4 an alternative API was in use. In fact there were two APIs defined, but neither of these is now documented and Microsoft recommends the new API.
This means that TTS and SR engines will have to be labelled as either SAPI 4 compliant (if they use the old interfaces) or SAPI 5 compliant (if they use the new interfaces). However, given the widespread use of the older interfaces you shouldn't expect Microsoft to stop them being available any time soon.
The Microsoft Speech SDK can be obtained via Microsoft's Web site and when installed provides documentation on the APIs. Because of the complete differences between SAPI 4 and SAPI 5.1 you can install both SDKs on a single machine and take advantage of any of the available APIs.
SAPI applications programmers call the interfaces defined in the API and SAPI-compliant TTS and SR engines implement those interfaces. SAPI supports text to speech (TTS), speech recognition (SR), dictation speech recognition (DSR) and also telephony (TEL). We will explore TTS and also see what's involved with both types of SR in this paper.
The SAPI 4 SDK is available in two flavours from http://www.microsoft.com/speech/download/old. You can download the SDK itself (the download file is called SAPI SDK 4.exe and is around 8Mb) or the SDK Suite (SAPI SDK 4 Suite.exe, around 40Mb). The SDK contains the runtime binaries and documentation, but no speech engines. The SDK Suite also contains Microsoft's TTS and SR engines, as well as a couple of useful applications (Microsoft Voice and Microsoft Dictation). You would be advised to download and install the SDK Suite.
In order to get anywhere we need some Pascal representation of the various interfaces, constants and structures defined by SAPI. You can get everything you need from those helpful people in the JEDI project (http://delphi-jedi.org). A translated version of the needed SAPI files can be obtained from http://delphi-jedi.org/api/sapi.zip. This provides two Delphi import units, speech.pas and spchtel.pas, which correspond to speech.h and spchtel.h from the SDK.
Of the two, speech.pas is the key unit, as it defines all the important interfaces you will need that are not defined anywhere in type libraries.
Note: there are various issues, anomalies and bugs in SAPI 4, which are mentioned as notes in this paper where they crop up. The number of issues in the entire API was one of the reasons Microsoft decided to start from scratch with SAPI 5 (a directive from the upper echelons of the company started the SAPI project afresh with new developers at version 5).
Note: Windows 2000 has the SAPI 4 runtime binaries installed by default (in C:\WINNT\Speech) along with the SAPI 4 compliant Microsoft TTS engine (although with only the Sam voice available). Installing the SAPI 4 SDK Suite gives additional voices and also the Microsoft SR engine.
You can download the latest SAPI SDK from http://www.microsoft.com/speech/download/SDK51. There are no specific import units required to program with SAPI 5.1. Most of the key functionality is exposed through a number of rich Automation objects and the type libraries contain all the constants, types and interfaces required to implement SAPI 5.1 applications.
Note: Windows XP has the SAPI 5.1 runtime binaries installed by default (in C:\Program Files\Common Files\Microsoft Shared\Speech) along with the SAPI 5.x compliant Microsoft TTS engine (although with only the Sam voice available). The downloadable version of SAPI 5.1 is more recent that the version shipping with Windows XP.
The older SAPI 4 interfaces are defined in two ways. There are high level interfaces, intended to make implementation easier, but which sacrifice some of the control. These are intended for quick results but can be quite effective. There are also low level interfaces, which give full control but involve more work to get going. These are intended for the serious programmer to work with.
The high level interfaces are implemented by Microsoft in COM objects to call the lower level interfaces, taking care of all the nitty-gritty. The low level interfaces themselves are implemented by the TTS and SR engines that you obtain and install.
You can find coverage of the using the SAPI 4 high level interfaces to build speech-enabled Delphi applications by clicking here.
Coverage of using the low level interfaces can be found by clicking here.
SAPI 5.1 consists of low level COM interfaces and rich, high level Automation interfaces. There are no Delphi translations of the COM interfaces, so we are limited to using the Automation interfaces (these were not present in the original SAPI 5.0 release but were added in the SAPI 5.1 update). Information on using the SAPI 5.1 Automation interfaces to build speech-enabled Delphi applications can be found by clicking here.
Adding various speech capabilities into a Delphi application does not take an awful lot of work, particularly if you do the background work to understand the SAPI concepts.
There is much to Speech API that we have not looked at in these pages but hopefully the areas covered will be enough to whet your appetite and get you exploring further on your own.
Thanks are due to Alec Bergamini of O&A Productions for help getting out of a number of holes whilst writing these articles. O&A productions develop a set of native Delphi components that make SAPI application development much simpler - you can find more information at http://www.o2a.com.
Brian Long used to work at Borland UK, performing a number of duties including Technical Support on all the programming tools. Since leaving in 1995, Brian has spent the intervening years as a trainer, trouble-shooter and mentor focusing on the use of the C#, Delphi and C++ languages, and of the Win32 and .NET platforms. In his spare time Brian actively researches and employs strategies for the convenient identification, isolation and removal of malware.
If you need training in these areas or need solutions to problems you have with them, please get in touch or visit Brian's Web site.
Brian authored a Borland Pascal problem-solving book in 1994 and occasionally acts as a Technical Editor for Wiley (previously Sybex); he was the Technical Editor for Mastering Delphi 7 and Mastering Delphi 2005 and also contributed a chapter to Delphi for .NET Developer Guide. Brian is a regular columnist in The Delphi Magazine and has had numerous articles published in Developer's Review, Computing, Delphi Developer's Journal and EXE Magazine. He was nominated for the Spirit of Delphi 2000 award.
Go back to the top of this page
Go to the SAPI 4 High Level Interfaces
Go to the SAPI 4 Low Level Interfaces coverage
Go to the SAPI 5.1 coverage