After a lot of research I (think) I have a reliable enough method, using Aubio, of identifying speaker gender. This is not crucial to the system, but it’s a nice touch.

The fundamental frequency of audio can be used to determine gender. Men typically fall within the 75-150 Hz range and women in the 150–300 Hz. Overlapping is possible, so any voice in the 140-160 Hz range will be treated as ‘undetermined’.

This is language neutral – meaning that it can be used in any language.

My initial test results (from random YouTube clips):


Male 01: 131 Hz fundamental frequency
Male 02: 117 Hz fundamental frequency

Female 01: 191 Hz fundamental frequency
Female 02: 194 Hz fundamental frequency

Very low female voice: 147 Hz fundamental frequency

The last sample will fall in the ‘undetermined’ range.

Now that I have this working, the next step is to get proper speaker identification working.

I’m digging into some pretty serious research around speaker identification, really really interesting articles.

My best bet so far is fingerprinting the training data. Each user that registers will be required to say the name of the system 3 times. Each attempt will be fingerprinted and matched back after training. This should provide good enough recognition of the speaker.

Using echoprint seems viable, but no print is created for the short samples I need. I’ve duplicated the sample now, it generates a print, but if they will be matchable in the server we’ll need to see…

Figure out how make the computer talk

Text-to-Speech (TTS) is a pretty well-known field, so finding some libraries to just this was easy. I’m going with Microsoft’s Speech Platform, this is the same platform used by Windows as well as Kinect. Easy to implement and doesn’t sound too much like a computer. NOTE: I am using the Server API for Speech Synthesis, the desktop API’s voices aren’t that great.

Links Microsoft Speech Platform

Figure out how to make the computer listen

Speech recognision, also called speech-to-text, dictation, etc.. This is bit harder to get working correctly. Here is the challenge, someone could say anything at any moment and the computer need to recognise and put the speech to text. I’m also going with the Microsoft Speech Platform for this, mainly because of the work done by Microsoft. This will most likely make it into Windows soon.

Humans always have some context to understand speech in, in other words, because of all our senses, we always know who is talking to us, and if someone is indeed talking to us. For a computer, 5 people in a room talking to each other, is actually 5 people talking to the computer at the same time. To fix this, I’m adding a keyword to initialize free-speech (dictation) mode. As Tony Stark does it “Jarvis, something something something”. Jarvis is his keyword (which he doesn’t always use), mine will be the name of the system, which is user selected (and changeable).

At this point, doing about 30 minutes of speech recognition training on your Windows computer will be needed. A small trade for better results – until the new speech recognition engines become available. – Scrap that. After some experiments I found that dictation is not yet accurate enough, and requires higher quality audio (headset type) which won’t be possible. I’m switching to the Speech Server/Kinect API, this works with microphones in a room or even ‘phonecall’ quality audio. That should do. Also makes my life much much more complicated because it doesn’t ‘truly’ support free-speech, or dictation speech. It only matches against words and phrases you tell it to…

Figure out the intent of what a person said

I want this system to work on free-speech, or it would seem like it to the user. No defined keywords and order will be required. This will make it much more natural. The Speech API allows me to build a rule graph that should be able to handle just that.

To be more clear:

  • “I would like a coffee”
  • “I’d like coffee”
  • “Get me a coffee”
  • “Coffee please”

Each phrase is different, but the intent is the same: the user wants coffee. It makes no difference which phrase in the rule is actually spoken. If the spoken phrase is defined within that rule, the rule is considered successfully matched by Limitless.

Once the phrases has been matched to words/converted to text, I’ll need something more powerful to actually figure out intent. Some reseach I’m looking into now come from Stanford University:

The Stanford Parser: A statistical parser

Stanford Named Entity Recognizer (NER)

Figuring out intent is probably the second most important part of Limitless.. That’s where I am now.

Figure out who is talking (or authentication)

At first I thought each user of the system would simply have a different name to call the system, but that doesn’t really cover anything that requires some for of authentication, and the system doesn’t ‘truly’ know who is talking.

That lead me to voice print analysis and all kinds of crazy maths and research. I found a library that claim it can do this, but I need this to happen within 0.2 seconds… anything longer will start delaying the response. I can overcome this with threads and some other ways, but seems like to much of a workaround.

But I need to know who is talking, because this system will be indistinguishable from magic.

VoiceID Python Library

Here is what a spectrogram, or voice print, looks like for a male saying ’19th Century’ (from Wikipedia)

voice print