I have finished my speech recognition program in C++, and now I am building the dictionary. I have a problem about this and I need some helps. I am very tired from training speech. Step by step, I recorded word by word and trained it, so I want to train automatically. In other words, I want to extract phoneme from an audio file. Has anyone done this yet? Please guide me. ( I want to implement it without any software ). I think I can't process immediately on the value which was read by sndfile.h library...
I mean I want to extract every words in a audio file. On the other hand, I want to filter out the noise and my audio file will be divided into some audio files, each of them will have only one word from a sentence which I spoke
P/s: Sorry for my poor English :(
¿word or phoneme?
for words you could simply detect silence
for phoneme, I've used the second LSP coefficient to approximate f0. So detect the vowel and take a little of time before it starts
(had a consonant-vowel scheme, it was having issues with `m' and `v')
I did this exact exercise a couple of years ago. I wish I still had the source that I wrote. Here's what I did:
1. A script containing a list of words. Each word was on a newline.
2. A WAVE file where my wife read each word (an seperated each one with some space)
3. A squelch level that would define the "noise" level. The noise level varied from day-to-day depending on the traffic outside.
4. An estimate of the minimum time between words.
1. Read in each word from the script and store it in a queue.
2. Load the wave file into memory.
3. Create a new wave file and use the first name in the queue as it's name before the .wav.
4. Start going through the wave file.
5. When the sound level exceeds the "squelch" limit, mark the start of the word (you may want to mark the start of the word a few samples earlier).
6. When the sound level is below the "squelch" limit for the pre-defined minimum time between words, mark the end of the word at the current time minus the time between words.
7. Copy all of the samples between your two marks into your new wave file and save it.
8. Repeat at step 3.
If there were 50 words in your original script, you will (hopefully) have 50 .wav files in your output directory. You may need to play with the squelch, and the delay between words to get a good match. You also need to ensure that each word is not cut off and so you may need to play with offsets on the start/end words.
At one point I found it was useful to make all of the bookmarks first, check if the number of words found in the script matched the number of words found in the wave, and THEN do the copying.
The best test I had for doing this was to count from one to ten. A number like "Ten" will have a very good start and end detection, however "Six" has a very slow start and so the "S" often got cut off, hence you should check this. I think "Four" was a very quiet one, so sometimes it didn't get recognized at all of the squelch was too high.
@ne555: but now i don't have digital-microphone ( i used mic which was built-in laptop ) so it's hard to estimate the silence value.
@Stewbond: how can you detect the "squelch" limit? I read the wave file by sndfile.h, then I wrote down all vaules to a txt file to find the squelch limit. But I saw nothing :(. I have no idea what I am doing :(, can you explain to me?
Audacity is a great tool you can download to analyze waveforms. Record your wave file, then open it with Audacity. This will show you what your voice looks like in a waveform and gives you an idea of the noise level of your setup. When you are not talking, the noise level should not really change.
Visualizing the waveform I think will really help you.