INTERACTIVE SYSTEM FOR TEACHING SPEECH PRONUNCIATION & READING
FIELD OF THE INVENTION The present invention relates to speech teaching, generally, and, in particular, to a software-based system for the teaching of correct speech and reading.
BACKGROUND OF THE INVENTION
It has long been sought to provide ways of teaching correct pronunciation of a particular language, inter alia, for the purpose of learning correct pronunciation of a foreign language, for general speech therapy. It has also been sought to provide a tool for evaluation and guidance of one's reading of a native or foreign language.
Known art includes the following patent publications:
US Patent No. 4,636,173 which discloses a method of teaching reading by synchronizing a visual display with a soundtrack by momentarily highlighting displayed words as they are emitted by the soundtrack.
US Patent No. 5,142,657 which relates to a computerized system for providing a visual output of analyses of speech including the parameters of waveform, power, pitch and sound spectrograph, and comparing these parameters with corresponding model parameters.
US Patent No. 5,286,205, which relates to a method of teaching spoken English using mouth position characters. This method is based on a visual display of mouth positions required for different pronunciations.
US Patent No. 5,393,236 which describes a computer-based interactive speech pronunciation apparatus and method.
US Patent No. 5487671 relates to a computerized system for teaching speech that evaluates accuracy of pronunciation relative to a stored database according to one or more speech parameters.
US Patent No. 5,503,560 which relates to a computerized system for speech training, in which a user is prompted in the pronunciation of keywords. The system records a first attempt at pronunciation of a keyword and compares subsequent attempts with the first attempt. An improvement in pronunciation is claimed to be correlated with a significant deviation in user's speech template. There is also provided a display which shows a required mouth shape for the sounds to be learned. A video analysis of the
user's actual mouth positions may also be provided by use of a video pick-up and analyzer.
Published PCT application no. WO 91/00582 which relates to a system which compares pronunciation of a word or sentence with a reference word or sentence, and which provides audio and video displays of the comparison.
The above patent publications are characterized by various disadvantages, as follows:
US Patent No. 4,636,173 discloses a method which is not interactive, and thus does not provide any indication to a student as to the accuracy of his pronunciation, nor does it indicate a way of achieving correct pronunciation.
US Patent No. 5,142,657 provides a computerized method which does not provide an easily interpretable feedback, and does not provide an explanation of how to improve pronunciation, merely which parameters of speech need to be improved. Furthermore, a display of these parameters, while they may be suitable for expert users in language laboratory, will not be helpful to less skilled students or children.
US Patent No. 5,286,205 relates to a teaching method which is not interactive, such that a student has to judge for himself whether or not his pronunciation is correct, there being no objective feedback thereof. Furthermore, the method teaches use of different mouth positions, and cannot therefore be used for all sounds for which the mouth position is not the only important key to correct pronunciation.
US Patent No. 5,393,236 relates to a computer-based interactive speech pronunciation method which is not self-sufficient and which requires supervision by an instructor and, moreover, does not in any way guide user towards correct pronunciation.
US Patent No. 5487671 relates to a computerized system for teaching speech that evaluates accuracy of production relative to a stored database according to one or more speech parameters. It does not provide to the user an indication as to how to improve his pronunciation, nor does it point to the user the nature of his mistakes, nor does it provide an algorithm that allows the evaluation of a given pronunciation, nor does it provide a methodology for dealing with pronunciation mistakes at various levels.
US Patent No. 5,503,560 relates to a computerized system that judges improvement in pronunciation according to a deviation in user's own voice, but it does not directly compare user's pronunciation to that of native speakers of the language, nor does it direct the user as to how to improve his pronunciation, nor does it point out to the user the mistakes made within the phrase or keyword.
Published PCT application no. WO 91/00582 describes a system which does not provide any indication of how to achieve correct pronunciation.
In general, known methods do not enable a student either to learn correct pronunciation of parts of speech or to learn how to read, such as when the student is a child being taught to read in his native language, wherein the feedback is based on totally objective criteria, and is totally interpretable by and thus immediately useful to a student without requiring interpretation or guidance by an instructor.
SUMMARY OF THE INVENTION
It is thus an aim of the present invention to provide a fully interactive, self-contained system for teaching pronunciation of language sounds. This system may also be used for teaching a person how to read in his native language. In particular, a speech recognition algorithm is provided so as to enable full interaction between a student and the system, in real time.
In particular, the software of the invention is employed in the system such that, in response to selected utterances, one or more visual stimuli, such as one or more moving images on a visual display unit are activated in a desired manner. An utterance which is not sufficiently accurate activates the stimulus, but not in the desired manner. Instruction is provided by way of displaying the correct tongue position inside the mouth, also known as articulatory positioning.
As will be appreciated from the description hereinbelow, the system of the invention operates both at the level of the individual phoneme, and also at the level of multi-phoneme strings, such as words and phrases.
There is thus provided, in accordance with a preferred embodiment of the invention, a system for teaching speech pronunciation or reading to a student. The system includes a memory for storing a plurality of speech portions, a playback system, associated with the memory, for indicating to a student a speech portion to be practiced, and a speech portion selector for selecting a speech task to be practiced.
The system is operated via an algorithm which is associated with the memory, the playback system, and also with a sound recorder which is operative to sense and record a sound uttered by a student, and to provide the utterance for processing by use of the algorithm, in signal form. The algorithm performs a comparison of the utterance with the speech portion to be practiced, evaluates the accuracy of the utterance, and provides
output signals for operating the playback system to provide to the student, preferably in real time, an indication of the accuracy of the utterance.
Further in accordance with a preferred embodiment of the present invention, the speech portion selector includes apparatus for selecting a phoneme in a selected phoneme class, and the algorithm is also operative to determine whether or not a phoneme present in the utterance belongs to the selected phoneme class. If a phoneme in the utterance is determined to be outside a selected phoneme class, then the utterance is 'rejected' as being inaccurate, and the student may be instructed to try again, if the system is operating at the single phoneme level. If the system is operating in the multi-phoneme string or word/phrase level, the student may be informed of the problematic phoneme or phonemes, referred to also herein as "subgroups" , and instructed to practice them before proceeding with the more complex task.
Additionally in accordance with a preferred embodiment of the present invention, the playback system includes visual playback apparatus and audio playback apparatus, and, in response to selection of a selected speech portion by a student, the visual playback apparatus is operative to display a visual image indicating the speech portion selected, and the audio display apparatus is operative to provide an audible indication of the speech portion selected.
Further in accordance with a preferred embodiment of the present invention, the playback system is operative to provide, preferably in real time, a dynamic visual image indicating the accuracy of the utterance.
Additionally in accordance with a preferred embodiment of the present invention, the playback system includes apparatus for displaying a movable visual image which is movable between first and second locations on the display, wherein the first location is a start location at which the visual image is located prior to sensing of a sound by the sound recorder, and wherein the second location is a target location, towards which the playback system is operative to move the movable visual image in real time as an indication of the accuracy of the utterance.
Further in accordance with a preferred embodiment of the present invention, the distance between the movable visual image and the target location is inversely proportional to the accuracy of the utterance.
BRIEF DESCRIPTION OF THE DRAWINGS
The present invention will be more fully understood and appreciated form the following detailed description, taken in conjunction with the drawings, in which:
Fig. 1 is a block diagram representation of an interactive speech pronunciation teaching system, constructed and operative in accordance with the present invention;
Fig. 2 is a schematic representation of the sequence of events employed in the present invention to teach correct speech and pronunciation;
Fig. 3A is a diagrammatic representation of a visual display "prompt" screen provided by the system to a student, in response to selection by the student of a particular word or phoneme;
Fig. 3B is a diagrammatic representation of a prompt screen provided by the system, in response to an incorrect pronunciation of a subgroup within a multi-phoneme string, visually emphasizing the incorrectly pronounced subgroup;
Fig. 3C is a diagrammatic representation of a real time, visual feedback prompt screen; and
Fig. 4 is a flow chart of the methodology employed by the present invention to analyze the speech of a student and to provide visual feedback of the student's performance.
DETAILED DESCRIPTION OF THE INVENTION The present invention seeks to provide a computerized system for the teaching of correct speech, particularly by use of speech recognition algorithms. The object of the system is the interactive teaching of correct speech in a language, either foreign or native, as well as for speech therapy. The system may also be used to teach reading, particularly to a child, of his native language. As will be appreciated from the following description, the system utilizes a variety of techniques in order to achieve this task, including: speech recognition, speech evaluation; speech error detection; accent recognition; real time visual feedback; audio feedback; and articulatory guidance.
The system may be adapted for use with a variety of computer systems, and may, in a preferred embodiment of the present invention, be fully self-contained within, for example, a suitable multimedia personal computer. In accordance with other embodiments of the invention, the present teaching system may be adapted for use via any suitable multimedia-enabled computerized platform which may or may not be constructed specifically for the system of the invention, or the system may be based on a computer network, such as Internet, intranets and the like.
Referring now to Fig. 1 , the system based on a preferred embodiment of the present invention includes a computer 10 equipped with a sound card, a visual display
unit (VDU) 12, typically, a high-speed color monitor, a manual data input unit 14, which may be a keyboard and/or a pointing device, such as a mouse, glide pad, or a touch-sensitive screen forming part of VDU 12, a microphone 16, and a speaker 18. The hardware components shown and described herein may be totally conventional, and thus, no further specific description thereof is necessary.
Referring now to Fig. 2, there is shown, a schematic representation of the sequence of events in a typical speech and pronunciation teaching session with a student, according to a preferred embodiment of the present invention. Examples of "prompt" screens displayed to the student in such a session are shown in Figs. 3A through 3C.
In a system implemented according to a preferred embodiment of the present invention, a typical speech and pronunciation teaching session includes the following sequence of events, shown schematically in Fig. 2:
First (block 20): the student selects a "level" mode, typically including either a lesson containing plural phonemes, such as a word or phrase (block 22), or subgroups of such, or a single phoneme (block 24).
It will be understood that a 'lesson' is a collection of production tasks having a common denominator. For example, in a production lesson for the phoneme /I/, the lesson plan would be words and phrases containing /I/ in various positions in the lesson words.
Second (block 26): in the event that that word or phrase level has been selected, one or more words or sentences containing the lesson subject, exemplified as the word "SHELL," shown on prompt screen 60 in Fig. 3A, are presented to the student. Preferably, the word is also sounded by the system, so that the student hears the correct pronunciation, which he is then to repeat.
In accordance with an alternative embodiment of the invention, the system may also be taught how to pronounce specific words or sounds, for example, so as to adapt it to a particular regional accent. In this case however, various default accents are retained in the system's memory so as to enable the system to be reset, if required.
Third (block 28): the student repeats the lesson word or words into the microphone 16 (Fig. 1).
Fourth (block 30) : the student's speech is analyzed and evaluated for errors in pronunciation; errors are indicated as by a display prompt, as seen on prompt screen 62 in Fig. 3B, in which the subgroup, in this case the phoneme /I/, is indicated as having been mispronounced.
In the event that the student pronunciation was successful, such that the objective has been completed (block 32) (by a correct pronunciation of the selected lesson subject), he may then be returned either to the level select mode (block 20), or to the lesson select mode (block 22).
In the event that the student pronunciation was not successful, it is determined whether his mispronunciation is "segmental," namely, relating to the phrase/word/phoneme levels or, "super-segmental." If the problem is super-segmental, i.e. it relates to stress or intonation, he is referred to a system dealing with that particular problem. That type of system is known in the art, and is thus beyond the scope of the invention, and is thus not dealt with herein. If the problem is determined to be segmental, then the student is transferred to the phoneme visual feedback and articulation mode (block 34), where he has the option of studying the inaccurately pronounced subgroup not only by imitation, but also by being shown the correct articulation, or required tongue positioning. In this mode, the system points out to the user the nature and location of his mistake.
Accordingly, at this stage, the system provides the student with the option of replaying own audio recording, while at the same time, providing a visual display of the subject phoneme (block 36). The system may also replay a model audio recording, for purposes of comparison.
The student then attempts to repeat the subgroup (block 38), which the system analyses and evaluates (block 40). If the student is unable to improve performance, the system enters visual feedback mode, indicated as "(audio)visual guidance and display" in block 40, which is shown and described herein in conjunction with Fig. 3C below.
Once the student has improved performance in this mode, the system may return either to the word/phrase display level (block 26), or to the phoneme select level (block 24), if he decides that single phonemes should be practiced and acquired prior to proceeding to word or phrase (subgroup) lessons. Otherwise, he is returned to a level whereat he practices the phoneme with which he is having trouble.
By way of example, consider a case wherein the subject of the lesson is correct pronunciation of the /I/ sound. The student is shown an animation of the word "SHELL" 60 on the visual display unit 12 (Fig. 1), as well as an appropriate icon (not shown) representing a shell. The system then plays back a model recording of "SHELL", and prompts the student to repeat it into the microphone 16. The student is provided with both visual and audio prompts.
When the student repeats the word, and, for example, mispronounces /I/, the system points out the error to the student, seen at 64 in FIG. 3B. This will be by some form of animation (not shown), as well as by an audio indication such as "you have mispronounced the /I/ in "shell."
The student can choose either to try to pronounce the word again correctly, or to receive further guidance in the form of real time visual feedback and articulation.
Real time visual feedback is based on the student's control of a targeting device appearing as on a prompt screen 66, seen in Fig. 3C. By use of a speech recognition algorithm, as described below, the system extracts predetermined relevant speech parameters from the student's rendition of a test phrase, word or phoneme, and transforms the student's performance into a distance from the appropriate target. In the example shown in Fig. 3C, the student is shown a prompt screen with /I/ as a target, the other target being the phoneme i. Different targets may also be provided, a 'default' target being the relevant 'mistake.' In other words, if a common mistake made when pronouncing /I/ is the phoneme /r/, then the default target, as seen in the drawings, will be M.
There exists, however, the option, particularly when the system of the invention is used within a supervised setting, of a supervisor (clinician or teacher) adding, changing or removing 'target' points of reference.
A targeting device, referenced 68, is also shown, being exemplified by a circle, which is initially positioned at a 'zero' position, over a pair of cross hairs 70 and 72.
Each time the student repeats the phoneme /I/, the targeting device 68 moves closer to or further from the target phoneme /I/, wherein the displayed distance between the targeting device 68 and the target phoneme is inversely proportional to the perceived acoustic "distance," or accuracy of the pronounced phoneme. If the student pronounces the phoneme correctly, the targeting device 68 is moved into coincidence with the target phoneme. An animation or other entertaining event may also be shown by way of reward.
In accordance with the present invention, a "correct" pronunciation is that whose extracted speech parameters are substantially the same as those of a database of recordings of that single sound, word or phrase, (which may also be referred to as a "multiple phoneme string"), adjudged to be well pronounced by a group of experts, such as speech therapists, or professional teachers of the language.
In accordance with an alternative embodiment of the invention, however, the system may be 'taught' or adjusted online so as to a new definition of 'correct.' For
example, if a particular pronunciation is perceived to be good, even though the system judged it as "bad", the system may be adjusted so as to accept that particular sound as valid or correct, either for a particular user, or, in general, for a group of users.
Additional options of articulatory guidance available to the student are graphical and acoustic demonstrations. A further option allows the student to receive visual feedback in the form of spectrum and spectrographic real time display, with acoustic targets superimposed, (not shown).
In a preferred embodiment of the present invention, the student is guided towards correct pronunciation by real time visual feedback. The analysis required to convert the student's rendition of the test phoneme or word is shown schematically in Fig. 4 and includes the following steps, all of which are performed in real time, by use of appropriate algorithms, as described below.
It will be appreciated by persons skilled in the art that the system of the present invention is operative to enable the provision of feedback to a user, in real time, due to the use of novel speech recognition algorithms, as described below in conjunction with Fig. 4.
While the speech algorithms and portions of the technique or techniques described below are known in various different fields, the use of speech recognition software in order to provide real time, objective speech pronunciation instruction, at the phoneme/word/phrase levels, such as in the present invention, is not known, per se, nor is it believed to have been considered in the art.
In particular, the following techniques are provided in the invention, and are described in detail hereinbelow, namely:
1. Primary Filtration: extraction of features enabling initial exclusion of fundamentally incorrect sounds.
2. Statistical Analysis: filtering procedures for enhancing the relevant parameters, exemplified herein as cepstral parameters, and for reducing the weight of those which are not.
3. Secondary Filtration: The use of "clustering pronunciation filters," for filtering out mispronunciations.
4. Continuous Classification Network: Determining location of sounds in relevant phonetic space.
Primary Filtration
As a prerequisite to performing the analysis of the invention, a database of "correctly pronounced" phonemes and words is collected. As described above, this may be changed for the needs of a particular user.
Accordingly, when detecting a sound, the speech parameters or features thereof are extracted (block 44) by the system by using "cepstral" techniques, as described in the book entitled "Discrete Time Processing Of Speech Signals," by John R. Deller, John G. Proakis, & John H. Hansen, published By MACMILLAN PUBLISHING CO., NY, 1993. This includes calculation of the cepstrum, 1st and 2nd cepstral derivatives, determination of pitch, energy and zero crossing (i.e. the number of times in a given time period that the speech signal crosses a zero level so as to switch between positive and negative values and vice versa). These data are used so as to enable a primary filtration of fundamentally mispronounced sounds, i.e. those which are adjudged to be out of bounds of the defined task.
If the detected sounds are not rejected based on the above primary filtration, they are then subjected to a statistical analysis (block 46), prior to being passed to a secondary pronunciation filter (block 48).
Statistical Analysis
The statistical analysis includes two main steps and is used to determine the number of, and the nature of, the most relevant parameters, and to reduce the dimensionality of the system of parameters.
The first step is in applying previously determined statistical weighting functions, thereby to enhance those parameters most relevant to the particular task at hand. These parameters are those which are statistically predetermined to have greater relevance to the task at hand.
Subsequently, in a second step, all of the above parameters, regardless of weighting, are analyzed by use of Principal Component analysis, also known in the art as the Karhunen-Loeve Transform. This analysis provides a new set of parameters, each being a linear combination of the previous, weighted parameters, such that, the ranking of the new parameters is a function of the variability and thus also of task relevance thereof. After obtaining the new set of ranked, weighted parameters, the parameters set can be truncated so as to reduce dimensionality thereof, while retaining a number of parameters which has been predetermined to be statistically representative of the task data.
Secondary Filtration As known in the art, phonemes can be grouped into major classes which share specific features. The secondary filtration stage includes performing of a geometric cluster analysis, in order to filter out utterances of individual phonemes that fall outside the class to which the particular acoustic production task relates. For example, if the task were the correct utterance of various sounds in a particular fricative class, such as /s/, IV, and so on, any mispronounced sounds which, by definition, could not be placed in the same phoneme class as these aforementioned fricatives, such as a "lateral" /s/, or a /z/, would be filtered out or rejected.
Continuous Classification Network
Subsequently, a non-linear, continuous, classification network (block 50), based on such methods as neural network, radial basis function (RBF) sets, or other, is trained using the extracted parameters so that it's output will continuously span the relevant "phonetic space." The continuous classification network is employed in order to determine where exactly the detected sound resides within the relevant phonetic space. Referring to the last example, the detected sound, passing the cluster analysis based filter, may now be detected to reside anywhere between the /s/ and HI sound. The targeting device will then be positioned accordingly.
If the system is being operated in the word/phrase level mode, such that the sounds spoken by the student, and being analyzed by the system, are a multiple phoneme string, containing a number of subgroups, then video and aural indications are provided to the user, indicating the quality or correctness of his pronunciation.
If, however, the system is being operated in the phonemic level mode, a further non-linear transform is used to project the neural network output onto the visual space of the display, so as to provide real time visual feedback, as described above in conjunction with Fig. 3C.
It will be appreciated by persons skilled in the art that the scope of the present invention is not limited by what has been shown and described above, merely by way of example. The scope of the invention is limited, rather, solely by the claims, which follow.