Here’s 308 hip hop artists, grouped by their lyrical similarity. Each artist’s position is based on the words that are central to their lyrics and then finding overlap with other musicians.

For example, noise of the year ”skrrt” is among the top ten words that characterize Migos, Kodak Black, and Lil Yachty.

And as you’d expect, several Wu-Tang members are close in lyrical similarity.

This is a story about how we got here.

presents

The

Language

of

Hip Hop

Last year, data scientist Iain Barr determined the most “metal” words using an elegant methodology and machine learning. We were blown away and eagerly waited for someone to replicate it for other genres.

One year later, we were still waiting. In 2014, we ranked rappers by the size of their vocabulary, and this felt like the perfect sequel...the words unique to hip hop.

To start, we need a dataset that represents hip hop. We decided to use 26 million words from the lyrics of the top 500 charting artists on Billboard’s Rap Chart (about 50,000 songs).

First, we’re going to rank words that are most common in hip hop lyrics. “Love,” for example, appears 21 times for every 10,000 words.

Among these few words, “love” is used in hip hop more. But what matters for our analysis is usage in hip hop compared to usage in all music lyrics. We want words unique to the genre: high frequency in hip hop, low everywhere else.

Let’s compare hip hop usage to other genres. We used lyrics from 275,905 songs (about 47 million words) spanning all music genres, except hip hop. In this non-hip-hop dataset, words such as “love” are even more common: 71 times per 10,000 words (vs. 21 in hip hop).

Here’s all the words we measured. Dots farther to the right are more popular in hip hop. Dots farther up-top are popular in other genres.

This means that words common in “hip hop” but rare in other genres appear in the the blue triangle. “Game,” for example, appears in hip hop far more often than it does in other genres.

In the red triangle are words, such as “sorrow,” that are common in genres other than hip hop. Words near the grey line are no more common in hip hop than anywhere else.

At the extremes, some words have incredibly high odds of only appearing in hip hop. And some rarely appear in hip hop.

In hip hop, you’re far less likely to hear these words. For example, “heart” was the 206th most popular word in hip hop. In other genres it is was ranked #28.

Which brings us to our top 50 “Most Hip Hop” list.

Many of these words are slang, words that may not have existed pre-hip-hop (or arguably they exist because hip hop).

There are some interesting non-slang entries, such as clique at #15. There’s also a handful of proper nouns, such as Biggie at #39 and Nike at #41.

What Words Are “Most Hip Hop”?
Occurrences per 10,000 lyrics

If that’s what’s popular in hip hop vs. other genres, what about specific artists? For example, what words are disproportionately used by Migos?

We need to change approaches. First, let’s start with something intuitive: N.W.A.’s usage of the word “police.”

About 37% of N.W.A’s tracks use the word “police,” well above the genre’s average of 5%.

That seems impressive. But averages can be really misleading. What’s going on with the other 499 artists in our hip hop dataset?

Just about every artist uses the word “police,” N.W.A just says it more often. Let’s see if we can define “central” more specifically with words that N.W.A basically owns.

Instead of “police,” let’s examine usage of “Compton.”

About 75% of the artists in our hip hop dataset never use this word – it’s rare. That’s a great signal: for the artists who do say “Compton,” it very much characterizes their lyrics.

What Makes a Word Central to an Artist?
% of Tracks that contain the word PoliceCompton

So using that approach, let’s look at which words are unique to the rest of the artists in our dataset. We’ll use an analysis (called tf-idf) which basically calls out words such as Compton, even over police. We‘ll take 1. words that an artist says more than the genre average (NWA’s use of “police”) and 2. rare words in hip hop – if an artist says it, there’s a good chance you know who it was (NWA’s use of “Compton”).

What Words Are Central to an ArtistEminem?

The Top 10 Words Central to an Artist’s Lyrics (using tf-idf)

We started this essay with a map of lyrically similar rappers, and now you can see how it was drawn: using the “central words” ranking.

If we take this ranking of words and line up it up against other artists, we’ll see overlap. For example, “Compton” is a top ten word for N.W.A, Eazy-E, Kendrick, and The Game.

To identify lyrically similar artists, we compare their full ranking of words to every other artist’s and assign a value to how close it was (we use a process called cosine similarity). Each artist’s “most similar” lyricist is depicted by hovering over their face.

Positioning each rapper next to his or her lyrically similar family is a difficult task. We’re comparing 308 artists to one another. Migos is most-similar to Gucci Mane. But Gucci Mane is most similar to Lil Wayne. The madness!

Luckily, there's a technique called t-SNE, where a computer considers all these relationships and tries to place similar artists closer together. It’s not perfect, but allows for visual exploration in a way that a table of data never could. We can also see broader groupings, such as region or era.

Mapping the Lyrical Similarity of Rappers

Using tSNE methodology

Explore Map touch to start

Pan and Zoom To Explore

This essay covers fairly advanced statistical concepts (including machine learning!), which you probably didn’t expect going in.

Which means we’re anticipating experts with machine learning and data science PhDs to email us with, “that was the most complete piece of bullshit ever. You didn't even consider....,“ proceeded by a 10-point response on how they would have done this differently.

But if you’re not one of those cantankerous experts, you might be intrigued by this mathematical wizardy. Which means you get:

Certificate
of
Completion

I unintentionally read about fancy math and stats

Things I may now have an opinion on:

Natural language processing, tf-idf, machine learning algorithms (t-SNE), and cosine similarity.

The Pudding

This was an outcome of our project on rapper’s vocabularies, which introduced many folks to the wonders of natural language processing.

To read more about each concept, check out the links below:

Unique Words to Hip Hop and Artists

Mapping Lyrically Similar Artists

Methdology Notes

The general music corpus was formed using data from LyricFind. We filtered hip-hop artists by cross-referencing their primary genre on MusixMatch.

For consistency, The hip hop data was cleaned using the same script as the LyricFind corpus. This included efforts to standardize spelling, remove capitalization, and apply light lemmatization.

Most Hip Hop: To find the words most “characteristic” of hip-hop, we computed the odds that a word appeared in the hip hop corpus vs. the general corpus. For example, this is # of appearences in hip hop corpus divided by total words in hip hop corpus. We then compare that to the same math for the general corpus.

Some words were filtered from this list that, while indexing high in hip hop vs. the general corpus, were still rather rare words. These all had fewer than 1,000 occurances in the hip hop corpus. For example “lowrider” had a 255:1 ratio in hip hop vs. other genres, but was only used 116 times in 26 million words.

TF-IDF: to determine the words that characterize each hip-hop artist, we used a technique called term frequency-inverse document frequency (tf-idf). Each rapper gets assigned a tf-idf score for every word in the hip-hop corpus. For a given word, we count the number of times it occurs in one rapper’s catalogue (its term frequency) and divide by the number of artists that use it across the hip-hop corpus (its document frequency). The words with the ten highest tf-idf scores for each artist were deemed the words “most unique” to him or her.

We made two slight modifications to the traditional formula. 1) We used sublinear scaling on the term frequencies, giving us a little more variation across our lists. You can read more about why you might want to do sublinear scaling here. 2) We also set a “cut-off” for document frequency of 0.1. That means, to be considered in our tf-idf computation, a term had to be used at least once by 10% of the artists in our dataset. This rules out words that are repeated over and over by one or a few artists (think “controlla” for Drake).

Cosine Similarity: Cosine similarity is a common way of calculating the similarity between two vectors by taking the cosine of the angle between them. In our case, that means taking the tf-idf vector for an artist and comparing it to that of another. Higher cosine values imply more similarity, with an upper bound of 1 when the vectors are perfectly similar.

t-SNE: To create our map of rappers, we used a dimensionality reduction technique called t-SNE. We took the tf-idf matrix and first reduced it to 50 dimensions using Truncated singular value decomposition (SVD). We then took the resulting matrix and fed it into t-SNE with a perplexity parameter set to 40. The output of the t-SNE algorithm mapped rappers to a two-dimensional space based on the similarity of their lyrics.

Special thanks to Josh Upton for edits.