Figure out how make the computer talk
Text-to-Speech (TTS) is a pretty well-known field, so finding some libraries to just this was easy. I’m going with Microsoft’s Speech Platform, this is the same platform used by Windows as well as Kinect. Easy to implement and doesn’t sound too much like a computer. NOTE: I am using the Server API for Speech Synthesis, the desktop API’s voices aren’t that great.
Links Microsoft Speech Platform
Figure out how to make the computer listen
Speech recognision, also called speech-to-text, dictation, etc.. This is bit harder to get working correctly. Here is the challenge, someone could say anything at any moment and the computer need to recognise and put the speech to text. I’m also going with the Microsoft Speech Platform for this, mainly because of the work done by Microsoft. This will most likely make it into Windows soon.
Humans always have some context to understand speech in, in other words, because of all our senses, we always know who is talking to us, and if someone is indeed talking to us. For a computer, 5 people in a room talking to each other, is actually 5 people talking to the computer at the same time. To fix this, I’m adding a keyword to initialize free-speech (dictation) mode. As Tony Stark does it “Jarvis, something something something”. Jarvis is his keyword (which he doesn’t always use), mine will be the name of the system, which is user selected (and changeable).
At this point, doing about 30 minutes of speech recognition training on your Windows computer will be needed. A small trade for better results – until the new speech recognition engines become available. – Scrap that. After some experiments I found that dictation is not yet accurate enough, and requires higher quality audio (headset type) which won’t be possible. I’m switching to the Speech Server/Kinect API, this works with microphones in a room or even ‘phonecall’ quality audio. That should do. Also makes my life much much more complicated because it doesn’t ‘truly’ support free-speech, or dictation speech. It only matches against words and phrases you tell it to…
Figure out the intent of what a person said
I want this system to work on free-speech, or it would seem like it to the user. No defined keywords and order will be required. This will make it much more natural. The Speech API allows me to build a rule graph that should be able to handle just that.
To be more clear:
- “I would like a coffee”
- “I’d like coffee”
- “Get me a coffee”
- “Coffee please”
Each phrase is different, but the intent is the same: the user wants coffee. It makes no difference which phrase in the rule is actually spoken. If the spoken phrase is defined within that rule, the rule is considered successfully matched by Limitless.
Once the phrases has been matched to words/converted to text, I’ll need something more powerful to actually figure out intent. Some reseach I’m looking into now come from Stanford University:
The Stanford Parser: A statistical parser
Stanford Named Entity Recognizer (NER)
Figuring out intent is probably the second most important part of Limitless.. That’s where I am now.
Figure out who is talking (or authentication)
At first I thought each user of the system would simply have a different name to call the system, but that doesn’t really cover anything that requires some for of authentication, and the system doesn’t ‘truly’ know who is talking.
That lead me to voice print analysis and all kinds of crazy maths and research. I found a library that claim it can do this, but I need this to happen within 0.2 seconds… anything longer will start delaying the response. I can overcome this with threads and some other ways, but seems like to much of a workaround.
But I need to know who is talking, because this system will be indistinguishable from magic.
VoiceID Python Library
Here is what a spectrogram, or voice print, looks like for a male saying ’19th Century’ (from Wikipedia)