In this project, we built a prototype AI email editor with text generation from Natural Language Processing (NLP) and asked 156 people to write emails with it. We compared different user interface settings and use by native and non-native English writers. Our motivation was to explore AI text generation to support writers — instead of aiming to replace them with AI. This article summarises what we have learned.
Building an AI email editor
As researchers in Human-Computer Interaction and AI, we set out to build and test an interactive system for writing with AI “in the loop”. The figure below shows its user interface, realised as a web app: It offers a simple text editor view, with typical functionality (e.g. type, edit, delete, move the caret). In addition, it displays phrase suggestions generated by the system.
Users can navigate and select these suggestions via mouse and keyboard. A short video showing the prototype in use is available on our YouTube channel here. For text generation, we used a pretrained model (GPT-2) and finetuned it on a large email dataset. We tested and iteratively improved our prototype in a pre-study with 30 people.
Testing the prototype in a user study
We recruited 156 people to test our prototype. About 40% of them were native English speakers. Each person completed four email writing tasks (i.e. short scenarios with a business context). Thereby, each participant used four different versions of our system, as shown in the figure below: In specific terms, we compared variations of our system that showed zero, one, three and six suggestions in parallel.
We recorded interaction data and used questionnaires to ask for opinions and subjective feedback. The study ran online and took about 22 minutes on average.
Results: Benefits and costs of writing with AI in the loop
Our analysis had two main parts: In one, we modelled and analysed users’ interactions as sequences (e.g. type, type, delete, type, pick suggestion, …). This revealed nine fundamental interaction patterns for writing with AI in the loop, in three categories: Producing text, revision and navigation. The figure gives an overview of these elementary behaviours.
Moreover, we analysed key metrics based on the logged interactions, including: How often suggestions were accepted, how often they were modified by users afterwards and how long people took to make their decisions. This provided insights into the benefits and costs of writing emails with AI in the loop, as summarised below.
Regarding benefits, multiple suggestions support writers in finding useful phrases, as is evident from several findings:
- Showing more suggestions in parallel increases the chance of accepting one of them.
- Showing more suggestions in parallel reduces the need for users to modify suggested text manually after acceptance.
- Giving users a choice of suggestions increases use of suggested text in the email overall.
However, suggestions also had costs in terms of time and user actions, thus adversely affecting efficiency:
- As can be expected, users take longer to choose from more suggestions.
- In particular with six suggestions, people take longer to write the emails overall.
- Suggestions change users’ actions in the user interface; for example, showing more suggestions partly shifts user actions away from typing in favour of navigation of suggestions in the lists.
Engagement with suggestions varies
A key insight of this study is that levels of engagement with suggestions varies. From low to high, examples include: typing in bursts while ignoring suggestions; intermittently integrating suggestions and using them more densely; and forming chains of multiple suggestions. Variations can occur both between people and for a single person over the course of writing. This suggests that in continuous interaction with AI, users dynamically shift between human-driven and AI-driven email composition. Hence, this provides an opportunity for future research to examine the contributing factors in detail.
Language proficiency matters
Our results further reveal insights into the impact of using text suggestions when writing in one’s native or a non-native language:
- Non-native speakers accept more suggestions and gain relatively more from seeing more suggestions in parallel.
- Time spent with multiple suggestions is less of an overhead for non-native speakers.
- Non-native speakers perceived suggestions as slightly more positive and influential, regarding wording, content and inspiration for using other phrases and words.
Overall, our findings provide three concrete takeaways for the design of digital writing tools that integrate AI for text generation:
Firstly, designers should be open-minded in exploring a larger variety of user interface parameters than previously considered for such systems. For example, exploring more than the de facto defaults of suggesting one phrase (e.g. Gmail Smart Compose) or three words (e.g. current smartphone keyboards) could prove to be of value.
Secondly, the design and development of such systems should consider user goals beyond efficiency: Current text generation predominantly aims to reduce typing and save time. However, our results also highlight opportunities to design for other goals, for example inspiration or language learning. Such goals also align well with a vision of human-AI collaboration, rather than replacement of human writers.
Finally, the needs and preferences of diverse user groups should be considered when designing and building such AI tools. For example, as shown here for language proficiency, different people (or one user in different contexts) may benefit from different settings, regarding aspects of both the user interface as well as the underlying AI system.
There is one more, broader takeaway here: Our results imply that designing suggestion user interfaces exclusively for efficiency may hinder some user groups in using the system as desired by them. Specifically, optimising AI writing tools for efficiency might not, for instance, be in the best interests of non-native speakers. Considering recent discussions about biases in large language models, this indicates that user interface design might be another potential source of bias to consider, with respect to the resulting interactive experience for certain user groups.
Many fundamental user interface design choices have not yet been explored for interactive uses of AI methods from Natural Language Processing. To address this, we studied one key user interface design dimension (parallel number of suggestions), plus one important user-related aspect (language proficiency), not previously investigated in this context.
We conclude that showing multiple suggestions is useful for ideation, at a cost of efficiency. How many suggestions to show in a user interface thus depends on the design goals and target users. In this regard, the observed usage differences between native and non-native speakers clearly underline the importance of designing interactive AI with a consideration of diverse backgrounds.
Our full analysis and report are available in the paper:
The Impact of Multiple Parallel Phrase Suggestions on Email Input and Composition Behaviour of Native and Non-Native English Writers. By Daniel Buschek, Martin Zürn, and Malin Eiband. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. CHI’21. Yokohama, JP: ACM, 2021. DOI: https://doi.org/10.1145/3411764.3445372
An arXiv version is also available: https://arxiv.org/abs/2101.09157
The blogs published by the bidt represent the views of the authors; they do not reflect the position of the Institute as a whole.