The recent announcement from Amazon that they would be reducing staff and budget for the Alexa department has deemed the voice assistant as “a colossal failure.” In its wake, there has been discussion that voice as an industry is stagnating (or even worse, on the decline). 

I have to say, I disagree. 

While it is true that that voice has hit its use-case ceiling, that doesn’t equal stagnation. It simply means that the current state of the technology has a few limitations that are important to understand if we want it to evolve.

Simply put, today’s technologies do not perform in a way that meets the human standard. To do so requires three capabilities:

  1. Superior natural language understanding (NLU): There are lots of good companies out there that have conquered this aspect. The technology capabilities are such that they can pick up on what you’re saying and know the usual ways people might mention what they want. For example, if you say, “I’d like a hamburger with onions,” it knows that you want the onions on the hamburger, not in a separate bag. 
  2. Voice metadata extraction: Voice technology needs to be able to pick up whether a speaker is happy or frustrated, how far they are from the mic and their identities and accounts. It needs to recognize voice enough so that it knows when you or somebody else is talking. 
  3. Overcome crosstalk and untethered noise: The ability to understand in the presence of cross-talk even when other people are talking and when there are noises (traffic, music, babble) not independently accessible to noise cancellation algorithms.

There are companies that