10 000 lines of open source code for Project Limitless sums up my personal efforts for 2016.

Project Limitless contains multiple projects written in C#, Go, PHP, HTML and JavaScript. With more than a thousand lines spent just on writing documentation and guides for these projects.

The rest of my personal work was aimed at proof-of-concepts, Project Limitless demo, learning Angular and React as well as Silicon Valley programming challenges – which I passed with great feedback.

In my professional work I had more web work than I would have liked – which resulted in a lot of PHP being written using Yii. Most of the systems-related work was again done in Go which has grown a lot since we first switched two years ago. I’m really looking forward to the 1.8 release next month. Python was used for working with serverless AWS Lambda functions.

You can find all the open source code on my GitHub.

I’ve combined all my code from 2016 into an infographic again. Below is an image, the full page can be found here.

My Year in Code 2016

My year in code 2016

Facial Recognition Screenshot

Facial Recognition is one of those things that can be extremely cool, but also one of those things that can be really creepy (aside from the privacy issues).

For Project Limitless my aim to introduce facial recognition and tracking is to enable the system to ‘see’ you, where you are, what you are doing and so on. It’s not really aimed to be taken out into the world, although it can.

For Mark 0.6 I plan to have streaming audio and video from any device to the core system ready. This will lay the groundwork for having everything in your home centered around a single ‘brain’ that can listen and talk to you wherever you are. However, for this demo video, I am just doing it from an Android device to test the idea.

The Demo

In the past two weeks I learned Android development with the simple goals in mind:

  1. Have the app work in VR (Google Cardboard)
  2. But have the display show the camera (Augmented Reality)
  3. and on top show tracked faces
  4. when recognized, show the name under the face

Easy enough right? Well kinda… Since this was my first Android app, it was the deep end, but once you get used to it, it’s kinda cool. After the two weeks I have a mashup app for you to see! This demo works on photos, but it actually works on live people as well, provided enough training photos (3+) are supplied.

Notice the display is cloned on the right, this allows it to work in a VR headset.

Until next time!

Virtual House Screenshot

In the last two weeks I’ve put in a lot of time to get the core of the platform together. I still have a very long way to go, but as it stands right now, the following modules are working:

  1. Speech Recognition
  2. Speech Synthesis (Text-To-Speech)
  3. Intent Recognition
  4. Natural Language Processing
  5. Plugin Infrastructure
  6. Data Storage API
  7. Multi-user support

Next I needed to test it out in the real world. Mostly for checking stability and if it can handle arbitrary input, what better way than to just throw it on the web in a live demo?

The Virtual House Demo

I decided to have people control a ‘virtual house’ with their voices (or text) using Chrome’s built-in Speech Recognition API. I wasn’t testing speech-to-text, so this worked fine. To get it all working I started the platform on an Ubuntu server, built a plugin to handle the state of every user’s house and connected the browser and the platform using Node.js and Socket.io. It works nicely, especially when you speak clearly.

There is still a lot to come, this demo is simply to show you where I’m heading with this project. Check out the demo video below of me using it, then head on over to projectlimitless.io and give it a try yourself!

Back in 2013 I played around with the idea of putting together my own J.A.R.V.I.S, but at that time I was also involved with my own start-up, Cirqls, and I never really got very far…

Fast-forward to 2016 and I have some spare time again (mainly between 10pm and 1am) as well as some daily tasks that I want automated. So I decided to revisit my old ideas and see what new technologies could support such a system. Turns out my old ideas were exactly that, old. After a couple of days thinking about the architecture I finally started working on it.

The main goal of what I call ‘Project Limitless‘ is to build a platform for naturally controlling all the technology around you.

For now, enjoy the introduction video, the rest will follow soon…

It was a busy 2015, got into many new front-end technologies and I was really impressed with how much (and fast) things have changed since 2014.

Personally, the highlight for me, from a systems view, was digging deeper in Google’s Go language. Most of my professional time was spent in the language and it is now my go-to language, which in 2014 was Python.

On the web side of things, PHP was still my go-to language, using the Yii Framework. Yii still allows me to work really fast and get the proof-of-concepts and projects done on time. For my personal site, I updated it using Google’s Polymer to get a feel for web components. Web components, for me, is one of those technologies that I can’t believe took so long to be developed, it just makes perfect sense – at least to most object oriented programmers.

I’ve combined all my code from 2015 into an infographic, just a nice summary of where I spent my time. Below is an image, but the full page can be found here.

My year in code 2015

My year in code 2015

After a lot of research I (think) I have a reliable enough method, using Aubio, of identifying speaker gender. This is not crucial to the system, but it’s a nice touch.

The fundamental frequency of audio can be used to determine gender. Men typically fall within the 75-150 Hz range and women in the 150–300 Hz. Overlapping is possible, so any voice in the 140-160 Hz range will be treated as ‘undetermined’.

This is language neutral – meaning that it can be used in any language.

My initial test results (from random YouTube clips):

Male 01: 131 Hz fundamental frequency
Male 02: 117 Hz fundamental frequency

Female 01: 191 Hz fundamental frequency
Female 02: 194 Hz fundamental frequency

Very low female voice: 147 Hz fundamental frequency

The last sample will fall in the ‘undetermined’ range.

Now that I have this working, the next step is to get proper speaker identification working.

I’m digging into some pretty serious research around speaker identification, really really interesting articles.

My best bet so far is fingerprinting the training data. Each user that registers will be required to say the name of the system 3 times. Each attempt will be fingerprinted and matched back after training. This should provide good enough recognition of the speaker.

Using echoprint seems viable, but no print is created for the short samples I need. I’ve duplicated the sample now, it generates a print, but if they will be matchable in the server we’ll need to see…

Figure out how make the computer talk

Text-to-Speech (TTS) is a pretty well-known field, so finding some libraries to just this was easy. I’m going with Microsoft’s Speech Platform, this is the same platform used by Windows as well as Kinect. Easy to implement and doesn’t sound too much like a computer. NOTE: I am using the Server API for Speech Synthesis, the desktop API’s voices aren’t that great.

Links Microsoft Speech Platform

Figure out how to make the computer listen

Speech recognision, also called speech-to-text, dictation, etc.. This is bit harder to get working correctly. Here is the challenge, someone could say anything at any moment and the computer need to recognise and put the speech to text. I’m also going with the Microsoft Speech Platform for this, mainly because of the work done by Microsoft. This will most likely make it into Windows soon.

Humans always have some context to understand speech in, in other words, because of all our senses, we always know who is talking to us, and if someone is indeed talking to us. For a computer, 5 people in a room talking to each other, is actually 5 people talking to the computer at the same time. To fix this, I’m adding a keyword to initialize free-speech (dictation) mode. As Tony Stark does it “Jarvis, something something something”. Jarvis is his keyword (which he doesn’t always use), mine will be the name of the system, which is user selected (and changeable).

At this point, doing about 30 minutes of speech recognition training on your Windows computer will be needed. A small trade for better results – until the new speech recognition engines become available. – Scrap that. After some experiments I found that dictation is not yet accurate enough, and requires higher quality audio (headset type) which won’t be possible. I’m switching to the Speech Server/Kinect API, this works with microphones in a room or even ‘phonecall’ quality audio. That should do. Also makes my life much much more complicated because it doesn’t ‘truly’ support free-speech, or dictation speech. It only matches against words and phrases you tell it to…

Figure out the intent of what a person said

I want this system to work on free-speech, or it would seem like it to the user. No defined keywords and order will be required. This will make it much more natural. The Speech API allows me to build a rule graph that should be able to handle just that.

To be more clear:

  • “I would like a coffee”
  • “I’d like coffee”
  • “Get me a coffee”
  • “Coffee please”

Each phrase is different, but the intent is the same: the user wants coffee. It makes no difference which phrase in the rule is actually spoken. If the spoken phrase is defined within that rule, the rule is considered successfully matched by Limitless.

Once the phrases has been matched to words/converted to text, I’ll need something more powerful to actually figure out intent. Some reseach I’m looking into now come from Stanford University:

The Stanford Parser: A statistical parser

Stanford Named Entity Recognizer (NER)

Figuring out intent is probably the second most important part of Limitless.. That’s where I am now.

Figure out who is talking (or authentication)

At first I thought each user of the system would simply have a different name to call the system, but that doesn’t really cover anything that requires some for of authentication, and the system doesn’t ‘truly’ know who is talking.

That lead me to voice print analysis and all kinds of crazy maths and research. I found a library that claim it can do this, but I need this to happen within 0.2 seconds… anything longer will start delaying the response. I can overcome this with threads and some other ways, but seems like to much of a workaround.

But I need to know who is talking, because this system will be indistinguishable from magic.

VoiceID Python Library

Here is what a spectrogram, or voice print, looks like for a male saying ’19th Century’ (from Wikipedia)

voice print

Hermes is my personal project to build a J.A.R.V.I.S (Tony Stark’s home computing system in Iron Man) type system.

The aim of this project is not to see if it can be done, but to see if I can do it.

Why is it called Limitless?

Purely because this project has no limits, I plan to make this so incredible that it becomes unbelievable.

Follow my progress

You can follow my progress on this project using the ‘Project Limitless’ link at the top-right of this page.