Virtual House Screenshot

In the last two weeks I’ve put in a lot of time to get the core of the platform together. I still have a very long way to go, but as it stands right now, the following modules are working:

  1. Speech Recognition
  2. Speech Synthesis (Text-To-Speech)
  3. Intent Recognition
  4. Natural Language Processing
  5. Plugin Infrastructure
  6. Data Storage API
  7. Multi-user support

Next I needed to test it out in the real world. Mostly for checking stability and if it can handle arbitrary input, what better way than to just throw it on the web in a live demo?

The Virtual House Demo

I decided to have people control a ‘virtual house’ with their voices (or text) using Chrome’s built-in Speech Recognition API. I wasn’t testing speech-to-text, so this worked fine. To get it all working I started the platform on an Ubuntu server, built a plugin to handle the state of every user’s house and connected the browser and the platform using Node.js and It works nicely, especially when you speak clearly.

There is still a lot to come, this demo is simply to show you where I’m heading with this project. Check out the demo video below of me using it, then head on over to and give it a try yourself!

Figure out how make the computer talk

Text-to-Speech (TTS) is a pretty well-known field, so finding some libraries to just this was easy. I’m going with Microsoft’s Speech Platform, this is the same platform used by Windows as well as Kinect. Easy to implement and doesn’t sound too much like a computer. NOTE: I am using the Server API for Speech Synthesis, the desktop API’s voices aren’t that great.

Links Microsoft Speech Platform

Figure out how to make the computer listen

Speech recognision, also called speech-to-text, dictation, etc.. This is bit harder to get working correctly. Here is the challenge, someone could say anything at any moment and the computer need to recognise and put the speech to text. I’m also going with the Microsoft Speech Platform for this, mainly because of the work done by Microsoft. This will most likely make it into Windows soon.

Humans always have some context to understand speech in, in other words, because of all our senses, we always know who is talking to us, and if someone is indeed talking to us. For a computer, 5 people in a room talking to each other, is actually 5 people talking to the computer at the same time. To fix this, I’m adding a keyword to initialize free-speech (dictation) mode. As Tony Stark does it “Jarvis, something something something”. Jarvis is his keyword (which he doesn’t always use), mine will be the name of the system, which is user selected (and changeable).

At this point, doing about 30 minutes of speech recognition training on your Windows computer will be needed. A small trade for better results – until the new speech recognition engines become available. – Scrap that. After some experiments I found that dictation is not yet accurate enough, and requires higher quality audio (headset type) which won’t be possible. I’m switching to the Speech Server/Kinect API, this works with microphones in a room or even ‘phonecall’ quality audio. That should do. Also makes my life much much more complicated because it doesn’t ‘truly’ support free-speech, or dictation speech. It only matches against words and phrases you tell it to…

Figure out the intent of what a person said

I want this system to work on free-speech, or it would seem like it to the user. No defined keywords and order will be required. This will make it much more natural. The Speech API allows me to build a rule graph that should be able to handle just that.

To be more clear:

  • “I would like a coffee”
  • “I’d like coffee”
  • “Get me a coffee”
  • “Coffee please”

Each phrase is different, but the intent is the same: the user wants coffee. It makes no difference which phrase in the rule is actually spoken. If the spoken phrase is defined within that rule, the rule is considered successfully matched by Limitless.

Once the phrases has been matched to words/converted to text, I’ll need something more powerful to actually figure out intent. Some reseach I’m looking into now come from Stanford University:

The Stanford Parser: A statistical parser

Stanford Named Entity Recognizer (NER)

Figuring out intent is probably the second most important part of Limitless.. That’s where I am now.

Figure out who is talking (or authentication)

At first I thought each user of the system would simply have a different name to call the system, but that doesn’t really cover anything that requires some for of authentication, and the system doesn’t ‘truly’ know who is talking.

That lead me to voice print analysis and all kinds of crazy maths and research. I found a library that claim it can do this, but I need this to happen within 0.2 seconds… anything longer will start delaying the response. I can overcome this with threads and some other ways, but seems like to much of a workaround.

But I need to know who is talking, because this system will be indistinguishable from magic.

VoiceID Python Library

Here is what a spectrogram, or voice print, looks like for a male saying ’19th Century’ (from Wikipedia)

voice print