How Algorithms Know What You’ll Type Next
If you have ever typed something on a smartphone, you have likely seen it attempt to predict what you’ll write next. This article is about how text predictors work, and how crucial the input language dataset is for the resulting predictions. To see how this in action, we will predict tweets by four Twitter accounts: Barack Obama, Justin Timberlake, Kim Kardashian, and Lady Gaga.
To be able to make useful predictions, a text predictor needs as much knowledge about language as possible, often done by machine learning. We will look at a simple yet effective algorithm called k Nearest Neighbours. This works by looking at the last few words you wrote and comparing these to all groups of words seen during the training phase. It outputs the best guess of what followed groups of similar words in the past.
Here is an example of using k Nearest Neighbours to predict tweet text. After choosing a person and an example tweet, move the slider to various positions in the text and it will automatically detect the last trigram (group of three words). It creates a database of trigrams from all tweets from that account, then searches for similar ones. The best matching trigrams will be displayed, along with the word that most often followed them.
k Nearest Neighbours
This approach is called context-sensitive text prediction. A downside of this technique is that it depends on similar groups of words being available in order to make a prediction. To go from this to a text predictor that would be good enough to be used in practice, we need two more things:
- As a backup, a list of words frequently used by the author
- Limiting the pool of probable predictions based on what the user already typed so far. For example, it does not make sense to predict the word “the” if the user typed “ca.”
The resulting predictor is a very simple statistical language model. It is built entirely of example language data—put in different data, get another model with new predictions. One way to see this is to let the models of our four celebrities predict each other’s tweets:
Language Model with Example Data
As expected, the model using the same author’s data usually makes the best prediction. This follows the general idea in machine learning: when the desired output is more similar to the data used to create the model (i.e., the training data), the better the results. Put simply, you are your own best language predictor.
Instead of eyeballing which model works better, we can measure the models and count the number of correctly guessed characters. The percentages below indicate how accurately each Twitter account model is at predicting an account’s next words.
Language Model with User Data
The percentages above can serve as boundaries of what is possible with this technique. Again, the more similar two Twitter accounts are, the more likely they are to correctly predict each other's tweets. Justin Timberlake and Lady Gaga are each other's best predictors. Barack Obama and Kim Kardashian, on the other hand, are each other's worst predictors.
This may not be the case for individual tweets. For example, Kim Kardashian's tweet about “working together with organizations” was predicted better by the language model of Barack Obama than her own. In other words, “you are your own best language predictor” is more of a tendency than a hard rule.
Our examples have been active Twitter accounts with thousands of tweets. But how do we solve for people with less data? Look to the language from the people around you. People who talk to each other tend to speak more alike, an idea that could be very useful in this case.
We can simulate this effect on Twitter by following @ mentions as a loose proxy for “people who talk to each other a lot.” These are the conversation participants of our four Twitter accounts that they mentioned most frequently (more than 10 times):
- @BarackObama: @VP (17 mentions)
- @jtimberlake: @ChrisStapleton (16 mentions), @AnnaKendrick47 (15 mentions), @jimmyfallon (15 mentions)
- @KimKardashian: @MakeupByMario (18 mentions), @khloekardashian (13 mentions)
- @ladygaga: @itstonybennett (30 mentions), @MarkRonson (16 mentions), @faspiras (15 mentions)
These Twitter “friends” are of course highly dependent on the way Twitter is used; whereas Lady Gaga and Justin Timberlake often address colleagues and other celebrities, Barack Obama almost exclusively uses Twitter address larger groups of people.
Let’s take a look at how these “friend” models perform.
Language Model with 'Friend' Data
In terms of accuracy, “Friend” models rank directly after the personal models in almost all cases. Part of this effect is explained by overlapping topics: the models of Justin Timberlake and Lady Gaga are probably good at predicting each other’s tweets because they are both tweeting about topics like songs, concerts and fans. Something similar is likely happening with the predictions of @VP (Vice President) for the tweets of Obama. Although Obama only mentioned @VP when it was still being used by Joe Biden, the tweets of current Vice President Mike Pence are still good predictors because of their political nature. Overlapping topics, however, are not the whole story. Research has shown that even if you look at filler words, such as “the”, “but”, “and”, “is”, etc., tweets by friends are still better predictors than tweets by random people.
This technique makes it possible to make predictions of what somebody wants to say, even if we have no previous material of a particular person. This can be the case if somebody has word finding problems (aphasia) or cannot move their speech organs (because of, for example, paralysis). A word predictor can be of great help to such a person, and it can be trained with the language of the people around them, an idea known as “language transplantation”.
There you have it: a simple technique for language prediction and how playing the inputs—the training material—can influence the resulting predictions. It is not only a matter of feeding the predictor with language, but also of making sure this language is similar to the language it will need to predict. Even without an understanding of language, text predictors work g