Introduction

While a Star Trek-like future of interacting verbally with pervasive computers is not yet here, many strides toward that have emerged in recent years. Intelligent assistants, such as Alexa, Siri, and Google Assistant, are becoming increasingly common.

Much of the implicit value proposition for using intelligent, conversational assistants is the use of natural language.  But language is an essentially social interface — it evolved to help us communicate with other people. How do the social aspects of language shape our interactions with language-based user interfaces? This is one of the questions that emerged in our research of conversational assistants.

To better understand what challenges these assistants pose today and where they help users, we ran two usability studies (one in New York City and one in the San Francisco Bay Area). A total of 17 participants — 5 in New York, 12 in California — who were frequent users of at least one of the major intelligent agents (Alexa, Google Assistant, and Siri) were invited into the lab for individual sessions. Each session consisted of a combination of usability testing (in which participants completed facilitator-assigned tasks using Alexa, Google Assistant, or Siri) and an interview. 

This article summarizes our main findings about user perceptions, mental models, and the social dimension of using these agents, while another article discussed findings related to usability and interactions.

We saw that users are highly cognizant of the fact that their supposedly intelligent assistants are not fully intelligent. While people don’t necessarily have a fully correct understanding of the assistants’ limitations, their views range from thinking of the assistant as somewhat creepy or childish, or to simply seeing it as an alternate computer tool. We’re far from the potential future state when users will trust a computerized intelligent assistant as much as they would trust a good human administrative assistant.

Anthropomorphic Qualities

The social nature of language made people project anthropomorphic qualities onto the computer. Most of our participants referred to the assistant using a gendered pronoun (“she” or “he” when they had selected a male voice).  Some also inserted politeness markers such as “please” or “thank you”, or started questions with “Do you think…”, “Can you…”

They often applied human metaphors when talking about the assistant. For example, one user said “Her mind goes blank. She’s like ‘What does this lady want from me? Leave me alone!’” when Alexa beeped, presumably to signal lack of understanding.

Another participant recounted, “I swear at her [Siri] when she doesn’t understand, and then she says something funny back — and we have fun.”  And yet another rationalized inappropriate answers by saying, “These are [complex] questions that I normally would not ask, so I guess it takes more thinking. So it cannot just answer something like that [right away]. The questions I ask usually aren’t that deep; they don’t take that much thought."

Our study participants were aware of this anthropomorphism and many laughed at it or commented on it. One Google Assistant user purposefully tried to refer to the assistant as “it”, saying “I just want the answer. I don’t want to talk to it like to a person [so I prefer not to say ‘OK Google’]. AI stuff already creeps me out enough. ”

People thought that the assistant was bad at detecting emotions and would not understand their frustrations from the tone of voice or choice of words. They had mixed attitudes when it came to using slang or idiomatic language: some felt that slang might not be understood by the assistant and purposefully avoided it; other said that they had tried it and that the assistant worked just fine.

Users did not expect agents to pick fine meaning distinctions. For example, a user who asked “How much is a one-bedroom apartment in Mountain View?” commented that her question was really too vague for Alexa , because, to her, the word “apartment” implies “rent” — if she had been interested in a sale price, she would have used the word “condo.” However, she did not expect Alexa to pick on that difference.

When People Use Assistants

We usually have no compunction carrying out a conversation with a real person in a public space. Yet this behavior did not apply to intelligent assistants. Users in our study reported a strong preference for using voice assistants only when they were at home or by themselves. Most people said they would not interact with a phone-based agent like Siri or Google Now in a public setting. Some, however, where willing to ask for directions while walking.

One participant phrased this plainly, noting “when I'm in public, I don't usually use [Siri] — I just think it looks a little bit awkward. It also feels awkward to use her in public. I also don't see other people using it in public.”

Whereas people usually reported using the assistants for simple, black-and-white queries, often in situations when their hands were busy, another common use was entertainment: many told us that at least occasionally they or someone in their family (usually their child) enjoyed hearing a joke or playing a game with the assistant. Several parents reported using Alexa as a way of entertaining their children and keeping them away from a screen-based device such as a tablet or a smartphone.

When asking fun, silly questions (for example, about the agent’s preferences, such as a favorite food), users in our study understood that they weren’t getting authentic artificial-intelligence responses, but only a series of preprogrammed jokes written by the engineering team. 

Language Used with Assistants

When it came to how they spoke with the assistants, participants could be divided into two categories:

  1. Those who used the same language structure as with humans. These participants were often polite in their query phrasing; they ended their interactions with “Thank you!,” and often formulated questions that started with “Please…”, “Do you think…”, “Can you tell me…”.  These participants usually had a positive attitude towards the agents.
  2. Those who tried to make the language more efficient in order to increase the chance of being understood. In this category, we saw a continuum of behaviors — from participants who changed sentence word order to make sure the query started with a keyword, to instances where they eliminated articles such as “a” or “the”, and, finally, to examples where participants simply used the assistant as a voice interface for a search engine and compressed their queries to a few keywords devoid of grammatical structure (such as  “OK Google, events in London last week of July”).

For example, one participant noted that a query like “Restaurants near Storm King” might retrieve different results than “Storm King Restaurants”, but “a person would get what I meant either way.”  

Several participants kept their queries short and consisting of a few keywords. One said, “I don’t speak to it as if it were a real person — unless I have one of 5W1H (Who, What, Where, When, Why, How) questions.[..] I would speak to a person like this only if they were ESL [English-as-a-second-language speaker] and I knew they may be overwhelmed by too many words in a sentence.”

While some users did refer to the assistant with gendered pronouns, most users did not use pronouns such as “it” or “this” in their actual queries — they preferred to refer to the object of their query explicitly, even if the name was long and complicated. People did not expect the assistant to be able to understand what a pronoun like “this” or “it” may refer to, especially when the antecedent of the pronoun was part of a previous query. (This is otherwise one of the key advantages of true natural-language recognition.) Although assistants are getting better at follow-up queries (and Google recently announced a “conversation” feature for its Google Assistant), most participants had learned to not expect them to preserve context from one query to another. One participant said, “[Siri] doesn’t usually save stuff like that [e.g. text message drafts]. When she is done with something, she’s done with it.”

What Queries Are Expected to Work

Even though people used language with the assistants and projected human-like qualities onto them, they had well-defined expectations about what tasks the agents could complete. Some said that the assistant was like a young child who could not understand complicated things; others compared it with an old person who did not hear very well. One noted that you cannot “speak for too long, because [the assistant] gets distracted,” while a different participant said that the queries should be shorter than 10 words. But many said that the assistant is even better than a human because it “knows everything” and is objective — has no emotions or feelings.

(In a detour to the coverage of future user interfaces in science fiction, we should note that the lack of emotions has often been described as a downside, with Lt. Data of Star Trek as a prime exhibit. However, objectivity and the lack of need to worry about the computer’s feelings are certainly advantages.)

Complex and Multipart Questions Considered Difficult or Impossible

Complex questions were considered to be unlikely to get good results, and complexity was often related heavily to whether questions or tasks needed to be broken down into multiple parts first. One user noted how she had had poor luck in the past asking Siri and Alexa to find out when Adele’s next concert in NYC was. Another user noted, “You can talk to it as if it were a child. No complex questions; weather, math, sports, calendar, internet [will work]; it won’t do your taxes and might not even be able to order you pizza.” (Note that the “child” analogy is yet another example of anthropomorphizing the assistant.) And yet another participant said that the most complicated thing one could ask an assistant was a place-based reminder.

However, there was a sense that even complex questions could be answered by the assistants if one “learned how to ask the question.” One participant compared the assistant with a file system with many files: the trick is to find the right questions with which to access the information in this database.

And yet another complained that she did not want to have to think about how to ask the right question. Many participants mentioned efficiency as being an important consideration in whether they were likely to bother asking a question — if it was faster “to do it themselves”, then they felt the interaction with the assistant was not worth the cost. That phrasing is a key indicator of our participants’ mental models — and reflects the belief that using an assistant should be easy, rather than requiring extensive interaction.

For example, several participants described how it’s faster to set a timer with an assistant than by using another device, whereas figuring out weekend traffic to Montauk would be quicker with an application such as Waze or using Google Maps on a computer. This decision was based on an implicit cost–benefit analysis balancing the anticipated interaction cost of completing the task themselves versus their expectation of whether the assistant would be able to do the task. 

Interpretation, Judgement, and Opinion Weren’t Trusted  

All users in the study noted that they weren’t interested in using agents for opinion-based information, such as recommendations. Any task involving judgement or opinion was regarded with skepticism by our participants. “I would never ask her that” was a common response to tasks involving personal or highly subjective information like figuring out how long one should spend in Prague for vacation. One user said, “I would look up a forum about Prague for locals; I don’t want to do the tourist things that everybody does.”

However, it was not only subjective information that participants scoffed at: one user thought it unlikely that Alexa would be able to tell him who the star player was in the previous night’s Boston Celtics game because that involved interpretation; this participant formulated his query to Alexa initially as “Who scored the most points in the last Celtics game”, and, when Alexa failed to answer his question, he changed it to “Provide me player statistics for the most recent Celtics game.” (Both questions with an objectively true answer, rather than questions of judgment.)

Another key distinction between intelligent agents and humans was noted by one participant:  voice assistants don't ask you clarifying questions if you give them an ambiguous request.  Humans will typically respond to an ambiguous statement with followup questions, but intelligent assistants attempt to carry out the request without getting additional information.

Fact-Based Tasks Perceived as Likely to Be Successful

Participants in our study frequently noted that certain kinds of simple tasks were a good fit for a voice assistant. Questions that were considered to work well were typically fact-based, such as checking the weather, finding out a celebrity’s age, getting directions to a destination with a known name, and reviewing sports scores. 

Mental Models

One of the questions that we asked our participants was what their assistant had learned about them. The responses were invariably “not much.” They felt that the assistant may keep track of some of their searches and use them, but they did not think that the assistant tailored its behavior significantly in order to serve them better.

Some described the assistants as “just” doing an internet search and acting as a voice interface to Google.  Others mentioned a list of preprogrammed things that agents could respond to, and anything outside of that preprogrammed menu of choices would not work and would default to a search. And some described a belief that these conversational assistants are not truly “smart” or even really a form of artificial intelligence that understood queries meaningfully or could solve new problems creatively.

One participant went as far as saying “Stop calling it AI (artificial intelligence) until it’s real. The things it can do are not complicated. It’s not learning. People form expectations based on the word ‘AI,’ but it’s misleading. Driverless cars — that’s AI. That’s closer.”

Another person described a similar belief when talking about Alexa: “There’s a difference between knowing what you’re saying and understanding what you’re saying.  You might say, ‘a tomato is a fruit’ and it wouldn’t know not to put it in a fruit salad. I don’t think she necessarily understands; maybe one day computers will get there, where they can teach or learn.  Right now, I think it’s more like a trigger word: this word, with this word, combined with this word gives this answer. If you took this word out and combined it with this other word, it’ll give this other answer.”

Attitudes Related to Trust and Privacy

Users in our study had issues with trusting intelligent agents, with a range of concerns:

  • Privacy and social awkwardness
  • Always recording and transmitting audio to the cloud
  • Consequences from misunderstanding what the user said
  • Contacting other people in an unauthorized way
  • Bugs that would cause smart-home features to work improperly
  • Using excessive mobile data

One of the most concerns expressed in our study was that the conversational assistants were always listening (and transmitting audio to the cloud).  Several users expressed strong skepticism (or outright distrust) to the claim that agents are only listening when triggered by their keyword.  During the interview portion of the study, some participants reported seeing advertisements for things that they normally never shopped for after mentioning them in a conversation near their assistant. A few said they had even engaged in informal tests of this hypothesis: they had mentioned an invented new hobby near their smart speaker or phone, and then saw advertisements for related products soon thereafter.

Some users also believed that the agents were recording and transmitting full audio files (rather than some sort of abstracted data version of what they said) to the cloud for interpretation. One user was very surprised that Google Assistant was able to take dictation (and did an excellent job) while the phone was not connected to the internet.

Participants reported some discomfort with using agents when an error or a misunderstanding could have consequences; the most common examples were making incorrect purchases or using the assistant for work.

One user related how he used a voice assistant to dictate a work email while walking home from the subway, and froze in panic when he noticed later on, while proofreading his email, that the agent had replaced something he said with an inappropriate word.  As he explained it, “it was like my phone had turned into a live grenade in my hand — I now had to defuse it very carefully.”

Another user mentioned that he owned a smart air conditioner, but would not use it with Alexa because he did not trust it to be bug-free and maintain a proper temperature at home. He mentioned that he has pets, and worried that, if it didn’t work properly, it could suffocate them in the heat, “I care for my animals — I wouldn’t trust Alexa with this.”

Future Potential for Intelligent Assistants

While science fiction films and TV from 2001 to Her provide us a wealth of examples of people comfortably using voice interactions with their computers for a range of complex interactions (especially tasks that involve interpretation, judgement, or opinion), the participants in our study were hesitant to even try more complicated tasks and did not trust that the assistants would perform well with such queries.

Even though these systems get better, discoverability of new features is low:  in our previous article on this topic, we noted how users accessed only a small subset of the functionality available in the intelligent assistants, and that they even often had to memorize query formulations that would worked. Presenting new options that weren’t previously available is tricky, as tutorials, release notes, and tips tend to be ignored by most users.

As these systems improve their abilities, a big challenge will be to modify users’ existing mental models so they could include some of these abilities. Users’ mental models are typically much more stable than the feature set of a product that’s being constantly updated, so a catch-22 emerges: users don’t know that the systems can handle more complex queries than before, and so they don’t use them for those types of tasks, which then reduces the amount of training data available to improve the assistants.

In other words, there’s quite the risk that the early release of low-usability intelligent assistants could impede the future use of much-improved assistants.

Summary

Even as intelligent conversational assistants rapidly improve in their ability to correctly understand user’s speech, there are still some major social and mental-model challenges that prevent users from interacting with these systems naturally. Trust issues and user expectations for these systems drive the adoption of the agents for tasks beyond simple dictation and fact-lookup requests.