Thursday, February 23, 2017

Voice UX: Ideas For A Voice App Quality Metrics Framework


In the midst of the ongoing hype around emerging voice-based Virtual Assistants (including Virtual Advisors, Virtual Companions, etc.), there is a crucial aspect that has been somehow pushed to the backburner and largely overlooked by the recent media analysis. That is indeed the overall voice UX quality, and particularly the usability aspects of the 1000s of skills, actions and voice apps already released or being offered to the public. Compared to the more traditional user interface paradigms such as CLI, GUI, etc., the lack of attention to the qualitative aspects seems to oddly suggest the naive belief that the voice user interface by itself could and should be considered enough to maximize automagically the quality of the user experience of any given application or product.

We all had our share of bitter experiences using the clumsy IVR (Interactive Voice Response) systems. Some of us would recall that the level of frustration has started dampening with the increasing improvement and adoption of NLP (Natural Language Processing) by service providers. With Apple Siri first and then with the new breed of wannabe CUI (Conversational User Interface) solutions by Google, Amazon, Samsung, Microsoft, etc., we discovered the actual and potential virtues of the overall voice UX. However, the expected hype of the marketers by one hand and the genuine curiosity and easy enthusiasm of the early adopters by the other have generated a thick smokescreen that has mostly masked the slithering consumers' dissatisfaction.

There are still none or not enough trustworthy analytics released to the public and openly supported by the major market players in regards to the actual consumers use of the thousands of Alexa skills, Google Home conversational actions, etc.. However, both our usability test and field interviews in the Silicon Valley area offer a number of clues that can be traced back to discoverable diffused consumers' frustration. 

Our Goal

The main objective of this quick report is not to compile a list of consumers' grievances but to suggest an initial framework for establishing measurable criteria that can allow for a more detailed evaluation of skills, actions, and other similar voice apps released or to be released soon based on emerging platforms.

The core of the voice UX is to make sure users find actual value in what is offered them. Based on an adaptation of the renowned Peter Morville's User UX Honeycomb model, we can establish -- and propose for a wider discussion -- a tentative voice app (v-app) quality metrics framework as follows:

  1. Is the v-app useful? This question tries to establish whether the content is original and satisfies a user's genuine need. In other words, the v-app design should actually present some innovation in functionality versus other comparable products. It should enable the user to achieve practical goals in a better way compared to what the other existing solutions would allow. 
  2. Is the v-app usable? Here we try to measure the overall ease of use of a given v-app. We obviously assume that v-app should work as claimed, without malfunctions.
  3. Is the v-app desirable? This question relates to those properties of a design that are deemed to trigger positive emotions and enjoyment on the user's end. In other words, a user should like the way a given v-app works in comparison to the existing solutions. 
  4. Does the v-app offer easily navigable content? This is all about being intuitive and natural, that is, as much as possible close to the user's spontaneous conversational expectations. 
  5. Is the v-app accessible for the people with disability? This is also a crucial aspect of a v-app. It is about a design that increments all the other qualities without making steeper the learning curve for people with some level of physical and/or cognitive challenge. 
  6. Is the v-app credible? This question relates to both which content a given v-app offers and the way the content is presented to the user. Credibility becomes of enormous relevance particularly when a content is used to support decision making in a number of critical domains such as health, diet, finances, legal issues, etc. . 
 For all the above mentioned 6 measurement criteria, we propose to adopt a standard rating scale from 1 (minimum) to 10 (maximum). Such a wide scale might appear to complicate the overall evaluation processes. However, we think that it is worth the effort because it allows to better capture nuances that down the road could generate unexpected dynamics in the users behavior.

We invite all the interested parties to intervene and help to test and refine as much as possible this voice UX quality metrics framework (QMF). We hope and believe that such initiative will assist users, voice UX designers and developers to better approach this emerging human-machine relationship experience as it expands increasingly into many domains and their related products and services. 

Saturday, February 18, 2017

Voice UX: Busting The Myth To Save The Soul

The unplanned and reportedly rapid growth of Amazon Echo line of products, and the sequel of similar devices crafted by Google, Samsung, and a number of other brands during the last couple of years or so have generated a crescendo of apparent public enthusiasm for "Voice First" appliances. 

The particular hype around the Amazon and Google voice gadgets has offered an unexpected spotlight to marketers and a number of self-styled specialists and voice-UX advocates to try making sense of this emerging consumer market segment. As of now, only a few usage-related surveys have surfaced with unknown methodology therefore with none or very limited value for any serious analytical consideration. Even the statistics related to the "actual" number of sold units by each vendor could not be used as an objective criterion of measurement given that their ad hoc leaks follow consumer market arousal tactics rather than target a desired and better public knowledge. Additionally, the substantial absence of a generally accepted conceptual framework for the specific field makes it difficult to read and interpret correctly the ongoing trends. Nor the promotional technology articles published almost on a daily basis do add actual insights given that they are often drafted by clueless staff writers. The ending result is the lack of a comprehensive analytical perspective, a greatly needed big picture. 

While awaiting the start of more systematic field studies, we can try to organize the current fragments of information by elaborating a few temporary useful concepts. Obviously, we assume that the reader is sufficiently familiar with Amazon Alexa and Google Home basic terminology.

In our view, a good starting point is the so-called "skills," as they are called in Amazon's platform jargon. According to the official sources, as of this writing, the "skills" could be defined as the platform's expandable task-oriented capabilities that allow users to interact with the Alexa-enabled devices in a more intuitive way by using voice. Currently, Alexa's feedback is mainly audio and only partially it supplements the information by using visual cards displayed inside its companion mobile App.

 If we run an extensive analysis of the skills, we can observe essentially three high-level categories of voice-based user experience elaborations that in a number of cases present overlapping areas:

  1. Conventional Voice Command (CVC) skills with added improvement through Natural Language Processing (NLP) 
  2. Interactive Custom Radio (ICR) skills that include audio pointcast of entertainment, news, economy, culture, education, sports, games, and other similar topics. 
  3. Ambient Intelligence Gateway (AIG) skills that include all those capabilities that allow users to interact with their home physical environment.
 Even if considered by itself, the Conventional Voice Command (CVC) category could still maintain a debatable relevance despite its large overlap with the other two categories. In fact, it seems a good idea to keep it as a marginal container to include all those skills that do not fall into the other categories. Additionally, this class of capabilities, even though in direct competition with the smartphones, could probably assume some importance over time -- and perhaps mainly for certain demographics such as aging adults -- as a home-based extra access point for Digital Assistants and Digital Advisors in a number of fields that would eventually include conversational e-Commerce, home-based healthcare, domestic legal advisory, personal and household calendar management, etc.

The Interactive Custom Radio (ICR) category is not only the quantitatively predominant one among the skills (probably it forms more than 90% of the overall current skills corpus), it is also the only one that has displayed, and still showing a very fast growth rate until now. Such a performance seems to be consistent with the Amazon's marketing goals of flooding the market segment, feeding the media hype and possibly wreaking havoc on any emerging competitions. Moreover, we may also consider other driving factors both on the developers' and consumers' end. Since it is an emerging platform, many developers understandably tend to focus on skills that are easier to design and implement both technically and in terms of available input feeds. All that also help to significantly shorten time to market (TTM) length. On the users' side, the ICR skills often receive a rapid acceptance for they reconnect with the decades-long consumer's habits of listening to traditional radio broadcasts.

The truly original and most interesting category with tremendous future possibilities is the Ambient Intelligence Gateway (AIG) class of skills. These skills are opening the way towards a Sentient Environment where sensing technologies, data processing and supporting middle-ware fuse to generate and maintain a representation of physical space in terms of a world model, allowing shared perception between computing devices and persons. To better contextualize the AIG category of capabilities, let's imagine bottom-up the four abstract layers of any Sentient Environment:

  1. Hardware (HW)  
  2. Firmware (FM)  
  3. Software (SW) 
  4. Cogniware (CW)
The 4th conceptual layer, that we may call Cogniware, is the one where the overall ambient embedded intelligence resides. With the development of the so-called "connected home," (called also "smart home", "home automation" and "domotics") we are actually taking the very first and basic steps towards the establishment of a domestic Sentient Environment as described above. The AIG category of skills would belong specifically to the Cogniware layer.

In their current release, the AIG capabilities are almost exclusively limited to either direct or conditionally triggered (by using services such as IFTTT) voice commands. Compared to the old-fashioned Interactive Voice Response (IVR) model, the present use of Natural Language User Interface (NLUI) offers an increasing linguistic flexibility. However, we are still substantially in a voice-based equivalent of the decades old computing Command-Line Interface (CLI) phase. Nevertheless, the most important voice user experience's intrinsic property, that is, invisibility, helps create a context somehow similar to what in the specialized literature is known as a Natural User Interface (NUI). The latter induces the feeling of an acquired "Shamanic" empowerment that would allow any user to connect to, and act upon the surroundings through spoken words.

The AIG skills have yet a long way to go before they mature and transform enough to merge into a Ubiquitous Access Layer (UAL) -- that is, a maturing Mediated Reality ecosystem (MR) -- and "dissolve" completely into the context of people's daily life. Additionally, we still have to see how these skills will integrate with other developing interactivity models such as Gesture-Control, for instance. The future of unfolding Sentient Environment, particularly its Cogniware layer (embedded intelligence), appears definitely promising and wide open to exciting new developments. The current voice interactivity feature is only the first step in a long march on a rocky road full of hills and cliffs.

Mobility vs Ubiquity: A Twisted Confusion

In the middle of the ongoing enthusiastic debate around the so-called Digital Assistants and conversational UX, it is not difficult to realize that two important words of the current technology lexicon are used or implied as if they were synonyms: Mobility and Ubiquity.

These words are not at all equivalent. Yet, apps design, investment decisions, strategic planning efforts, and end user device engineering are being made as if these words expressed the same concept. You may just ask: Is this only a question of exegesis and linguistic pedantry, or there is something a lot more critical that has to be revealed and discussed? 

First things first. According to the American Heritage Dictionary, mobility means “the quality or state of being mobile,” while mobile is any entity "capable of moving or of being moved readily from place to place", that is, throughout the physical environment. On the other hand, ubiquity is defined as “existence or apparent existence everywhere at the same time," that is, "omnipresence.” 

Based on these definitions, the relationship between mobility and ubiquity should be considered as similar to the way an automobile relates to the blue sky: A car moves among locations one at a time while the blue heaven is present everywhere. The fact of being able to see the sky (blue sky accessibility) through the car windows (means or channel of accessibility) doesn't eliminate the distinction: An automobile is and remains a roving entity while the blue heaven is a constant property of the environment that is universally accessible by using the basic human physiological senses. 

Notwithstanding the twisted lexical confusion, the reality once again exceeds the conventional wisdom. It is not actually difficult to distinguish two lines of technology development along two distinct emerging paradigms: Mobility as increasingly differentiated from ubiquity.

Mobility paradigm allows for dampening as much as possible the disruptive constraints of time and physical spaces. This is about technologies designed to secure an objective context, an incubator-like condition for an attainable level of continuity across discontinuous experiences of the real world.

Mobility paradigm is best expressed by what we still keep calling anachronistically 'smartphone'. For the foreseeable future, the smartphone, in fact, is proving itself as a key player while 'tablet', which just a few years ago rose as the new shining star of the mobile computing, has already begun its decline following the desktop PC and Mac. The crossbreed called "phablet" is just a proof of the amazing cannibalizing power of the diehard smartphone. 

The smartphone appears as the cyber-physical knight who has been cutting deep through our somnolent PC/Mac experience during the last decade or so. While laptops are becoming most probably the last sanctuary for the old PC and Mac universe, the smartphone is transforming at once itself and us by turning increasingly into the gateway of our personal Body Area Network (BAN): Wearable and implantable devices, and the upcoming replaceable smart human organs and body parts, as well as the futuristic nano-devices circulating in our bloodstream, will all route their bits and bytes through the smartphone. 

The smartphone is here to stay while urging our technology lexicon to invent a new name to better reflect its metamorphosis. Additionally, we have to expect a new wave of talking apps for the years to come. People increasingly conversing with their devices rather than placing a call through them, will become a constant component of our busy urban life landscape. The noisy streets of Beijing and Shanghai offer a dazzling anticipation of talkative devices that are growing in the shadows while preparing to move to the limelight of the streets and alleys of San Francisco and beyond.

'Hands-free' and 'eyes-free' features are destined to become the baseline expectations for an increasing number of mobility-addict users, while the smartphone screen is dissolving by expanding to our smart glasses, contact lenses and the fantastic world of holograms. Voicebots, chatbots or more in general xbots are surfacing as the "spiritual creatures" who will populate our smartphones memory. This way, the smartphone will become the butler of the unfolding servicesphere: Without the smartphone, the thriving 'O2O' (Online-To-Offline) business model will never see a complete adulthood.

Ubiquity paradigm is still in quest of its more genuine expression in the physical environment. That may partially explain the current lexicon confusion between mobility and ubiquity.

The unplanned and surprisingly rapid growth of Amazon Echo line of products, and the sequel of similar devices crafted by Google, Samsung, and a number of other brands are confirming the deep roots of flourishing demand among the consumers for sentient environments. No matter what the Apple CEO thinks, a smartphone is not the right answer to the users' ubiquitous sentience requirements. Smartphone satisfies, in fact, the need for a personal butler, that is, a smart companion who remains always the same and shadows an individual everywhere to meet her service wishes: Smartphone is about us, our personal lifestyle while moving in the physical spaces. 

 Ubiquity relates to an ambient property that doesn't follow a user but it is everywhere all the time as the baseline and embedded characteristic regardless of the (mobile) devices that a user might carry around. Here a few examples from the household environment: A doormat that recognizes the homeowner; window blinds that spontaneously adjust the daylight by detecting a human presence; a kitchen appliance enabled to identify and converse with its user; the speaker volume that complies with a simple gesture command; a bathroom mirror that can recognize a user and stream interactive and personalized news and, finally, toilet bowls and urinals enabled to warn for potential health conditions, etc. . 

The current success of products such as Amazon Echo or Google Home seems to arise essentially from their almost "Shamanic" property to act as a new access layer (very similar to what in computing is called "abstraction layer") between the user and the surrounding environment complexity. That's why it would be wrong to consider them a replacement for the smartphone. As mentioned above, the latter facilitates the mobility across different locations and timeframes while these new products are destined to make friendlier and more intuitive the physical locations such as homes, offices and gradually entire cities. 

However, we have to keep in mind that the 'access layer' property will inevitably evolve over time and could gradually get reabsorbed by the built environment components. In other words, the 'embodied technologies' (separate devices or gadgets) will increasingly transform into 'embedded intelligence' by disappearing into our surroundings until only the (virtual and/or fluid) user interface remains perceivable. From this point of view, all the current AI-based voice devices (Echo, Google Home, and other similar pieces of hardware) should be considered simply as transient stages of a higher level development that will eventually make them superfluous--at least in their current format. 

We still have to look forward to seeing the ubiquity paradigm finally usher in full-blown Ambient Intelligence. Very recently a friend reminded me an almost forgotten meme: “Facebook is the only place where it is acceptable to talk to a wall.” Well, with the Ambient Intelligence that won’t be true anymore.

* Originally published on '' on November 27, 2016. Since then it has been published also on '' on February 9, 2017.