Facial Recognition Screenshot

Facial Recognition is one of those things that can be extremely cool, but also one of those things that can be really creepy (aside from the privacy issues).

For Project Limitless my aim to introduce facial recognition and tracking is to enable the system to ‘see’ you, where you are, what you are doing and so on. It’s not really aimed to be taken out into the world, although it can.

For Mark 0.6 I plan to have streaming audio and video from any device to the core system ready. This will lay the groundwork for having everything in your home centered around a single ‘brain’ that can listen and talk to you wherever you are. However, for this demo video, I am just doing it from an Android device to test the idea.

The Demo

In the past two weeks I learned Android development with the simple goals in mind:

  1. Have the app work in VR (Google Cardboard)
  2. But have the display show the camera (Augmented Reality)
  3. and on top show tracked faces
  4. when recognized, show the name under the face

Easy enough right? Well kinda… Since this was my first Android app, it was the deep end, but once you get used to it, it’s kinda cool. After the two weeks I have a mashup app for you to see! This demo works on photos, but it actually works on live people as well, provided enough training photos (3+) are supplied.

Notice the display is cloned on the right, this allows it to work in a VR headset.

Until next time!

Virtual House Screenshot

In the last two weeks I’ve put in a lot of time to get the core of the platform together. I still have a very long way to go, but as it stands right now, the following modules are working:

  1. Speech Recognition
  2. Speech Synthesis (Text-To-Speech)
  3. Intent Recognition
  4. Natural Language Processing
  5. Plugin Infrastructure
  6. Data Storage API
  7. Multi-user support

Next I needed to test it out in the real world. Mostly for checking stability and if it can handle arbitrary input, what better way than to just throw it on the web in a live demo?

The Virtual House Demo

I decided to have people control a ‘virtual house’ with their voices (or text) using Chrome’s built-in Speech Recognition API. I wasn’t testing speech-to-text, so this worked fine. To get it all working I started the platform on an Ubuntu server, built a plugin to handle the state of every user’s house and connected the browser and the platform using Node.js and Socket.io. It works nicely, especially when you speak clearly.

There is still a lot to come, this demo is simply to show you where I’m heading with this project. Check out the demo video below of me using it, then head on over to projectlimitless.io and give it a try yourself!

Back in 2013 I played around with the idea of putting together my own J.A.R.V.I.S, but at that time I was also involved with my own start-up, Cirqls, and I never really got very far…

Fast-forward to 2016 and I have some spare time again (mainly between 10pm and 1am) as well as some daily tasks that I want automated. So I decided to revisit my old ideas and see what new technologies could support such a system. Turns out my old ideas were exactly that, old. After a couple of days thinking about the architecture I finally started working on it.

The main goal of what I call ‘Project Limitless‘ is to build a platform for naturally controlling all the technology around you.

For now, enjoy the introduction video, the rest will follow soon…

After a lot of research I (think) I have a reliable enough method, using Aubio, of identifying speaker gender. This is not crucial to the system, but it’s a nice touch.

The fundamental frequency of audio can be used to determine gender. Men typically fall within the 75-150 Hz range and women in the 150–300 Hz. Overlapping is possible, so any voice in the 140-160 Hz range will be treated as ‘undetermined’.

This is language neutral – meaning that it can be used in any language.

My initial test results (from random YouTube clips):

Male 01: 131 Hz fundamental frequency
Male 02: 117 Hz fundamental frequency

Female 01: 191 Hz fundamental frequency
Female 02: 194 Hz fundamental frequency

Very low female voice: 147 Hz fundamental frequency

The last sample will fall in the ‘undetermined’ range.

Now that I have this working, the next step is to get proper speaker identification working.

I’m digging into some pretty serious research around speaker identification, really really interesting articles.

My best bet so far is fingerprinting the training data. Each user that registers will be required to say the name of the system 3 times. Each attempt will be fingerprinted and matched back after training. This should provide good enough recognition of the speaker.

Using echoprint seems viable, but no print is created for the short samples I need. I’ve duplicated the sample now, it generates a print, but if they will be matchable in the server we’ll need to see…

Figure out how make the computer talk

Text-to-Speech (TTS) is a pretty well-known field, so finding some libraries to just this was easy. I’m going with Microsoft’s Speech Platform, this is the same platform used by Windows as well as Kinect. Easy to implement and doesn’t sound too much like a computer. NOTE: I am using the Server API for Speech Synthesis, the desktop API’s voices aren’t that great.

Links Microsoft Speech Platform

Figure out how to make the computer listen

Speech recognision, also called speech-to-text, dictation, etc.. This is bit harder to get working correctly. Here is the challenge, someone could say anything at any moment and the computer need to recognise and put the speech to text. I’m also going with the Microsoft Speech Platform for this, mainly because of the work done by Microsoft. This will most likely make it into Windows soon.

Humans always have some context to understand speech in, in other words, because of all our senses, we always know who is talking to us, and if someone is indeed talking to us. For a computer, 5 people in a room talking to each other, is actually 5 people talking to the computer at the same time. To fix this, I’m adding a keyword to initialize free-speech (dictation) mode. As Tony Stark does it “Jarvis, something something something”. Jarvis is his keyword (which he doesn’t always use), mine will be the name of the system, which is user selected (and changeable).

At this point, doing about 30 minutes of speech recognition training on your Windows computer will be needed. A small trade for better results – until the new speech recognition engines become available. – Scrap that. After some experiments I found that dictation is not yet accurate enough, and requires higher quality audio (headset type) which won’t be possible. I’m switching to the Speech Server/Kinect API, this works with microphones in a room or even ‘phonecall’ quality audio. That should do. Also makes my life much much more complicated because it doesn’t ‘truly’ support free-speech, or dictation speech. It only matches against words and phrases you tell it to…

Figure out the intent of what a person said

I want this system to work on free-speech, or it would seem like it to the user. No defined keywords and order will be required. This will make it much more natural. The Speech API allows me to build a rule graph that should be able to handle just that.

To be more clear:

  • “I would like a coffee”
  • “I’d like coffee”
  • “Get me a coffee”
  • “Coffee please”

Each phrase is different, but the intent is the same: the user wants coffee. It makes no difference which phrase in the rule is actually spoken. If the spoken phrase is defined within that rule, the rule is considered successfully matched by Limitless.

Once the phrases has been matched to words/converted to text, I’ll need something more powerful to actually figure out intent. Some reseach I’m looking into now come from Stanford University:

The Stanford Parser: A statistical parser

Stanford Named Entity Recognizer (NER)

Figuring out intent is probably the second most important part of Limitless.. That’s where I am now.

Figure out who is talking (or authentication)

At first I thought each user of the system would simply have a different name to call the system, but that doesn’t really cover anything that requires some for of authentication, and the system doesn’t ‘truly’ know who is talking.

That lead me to voice print analysis and all kinds of crazy maths and research. I found a library that claim it can do this, but I need this to happen within 0.2 seconds… anything longer will start delaying the response. I can overcome this with threads and some other ways, but seems like to much of a workaround.

But I need to know who is talking, because this system will be indistinguishable from magic.

VoiceID Python Library

Here is what a spectrogram, or voice print, looks like for a male saying ’19th Century’ (from Wikipedia)

voice print

Hermes is my personal project to build a J.A.R.V.I.S (Tony Stark’s home computing system in Iron Man) type system.

The aim of this project is not to see if it can be done, but to see if I can do it.

Why is it called Limitless?

Purely because this project has no limits, I plan to make this so incredible that it becomes unbelievable.

Follow my progress

You can follow my progress on this project using the ‘Project Limitless’ link at the top-right of this page.