presents
The
Language
of
Hip Hop
Last year, data scientist Iain Barr determined the most “metal” words using an elegant methodology and machine learning. We were blown away and eagerly waited for someone to replicate it for other genres.
One year later, we were still waiting. In 2014, we ranked rappers by the size of their vocabulary, and this felt like the perfect sequel...the words unique to hip hop.
To start, we need a dataset that represents hip hop. We decided to use 26 million words from the lyrics of the top 500 charting artists on Billboard’s Rap Chart (about 50,000 songs).
If that’s what’s popular in hip hop vs. other genres, what about specific artists? For example, what words are disproportionately used by Migos?
We need to change approaches. First, let’s start with something intuitive: N.W.A.’s usage of the word “police.”
So using that approach, let’s look at which words are unique to the rest of the artists in our dataset. We’ll use an analysis (called tf-idf) which basically calls out words such as Compton, even over police. We‘ll take 1. words that an artist says more than the genre average (NWA’s use of “police”) and 2. rare words in hip hop – if an artist says it, there’s a good chance you know who it was (NWA’s use of “Compton”).
What Words Are Central to an ArtistEminem?
The Top 10 Words Central to an Artist’s Lyrics (using tf-idf)
We started this essay with a map of lyrically similar rappers, and now you can see how it was drawn: using the “central words” ranking.
If we take this ranking of words and line up it up against other artists, we’ll see overlap. For example, “Compton” is a top ten word for N.W.A, Eazy-E, Kendrick, and The Game.
To identify lyrically similar artists, we compare their full ranking of words to every other artist’s and assign a value to how close it was (we use a process called cosine similarity). Each artist’s “most similar” lyricist is depicted by hovering over their face.
Positioning each rapper next to his or her lyrically similar family is a difficult task. We’re comparing 308 artists to one another. Migos is most-similar to Gucci Mane. But Gucci Mane is most similar to Lil Wayne. The madness!
Luckily, there's a technique called t-SNE, where a computer considers all these relationships and tries to place similar artists closer together. It’s not perfect, but allows for visual exploration in a way that a table of data never could. We can also see broader groupings, such as region or era.
Mapping the Lyrical Similarity of Rappers
Using tSNE methodology
Explore Map touch to start
Pan and Zoom To Explore
This essay covers fairly advanced statistical concepts (including machine learning!), which you probably didn’t expect going in.
Which means we’re anticipating experts with machine learning and data science PhDs to email us with, “that was the most complete piece of bullshit ever. You didn't even consider....,“ proceeded by a 10-point response on how they would have done this differently.
But if you’re not one of those cantankerous experts, you might be intrigued by this mathematical wizardy. Which means you get:
Certificate
of
Completion
I unintentionally read about fancy math and stats
Things I may now have an opinion on:
Natural language processing, tf-idf, machine learning algorithms (t-SNE), and cosine similarity.
The Pudding
This was an outcome of our project on rapper’s vocabularies, which introduced many folks to the wonders of natural language processing.
To read more about each concept, check out the links below:
Unique Words to Hip Hop and Artists
Mapping Lyrically Similar Artists
Methdology Notes
The general music corpus was formed using data from LyricFind. We filtered hip-hop artists by cross-referencing their primary genre on MusixMatch.
For consistency, The hip hop data was cleaned using the same script as the LyricFind corpus. This included efforts to standardize spelling, remove capitalization, and apply light lemmatization.
Most Hip Hop: To find the words most “characteristic” of hip-hop, we computed the odds that a word appeared in the hip hop corpus vs. the general corpus. For example, this is # of appearences in hip hop corpus divided by total words in hip hop corpus. We then compare that to the same math for the general corpus.
Some words were filtered from this list that, while indexing high in hip hop vs. the general corpus, were still rather rare words. These all had fewer than 1,000 occurances in the hip hop corpus. For example “lowrider” had a 255:1 ratio in hip hop vs. other genres, but was only used 116 times in 26 million words.
TF-IDF: to determine the words that characterize each hip-hop artist, we used a technique called term frequency-inverse document frequency (tf-idf). Each rapper gets assigned a tf-idf score for every word in the hip-hop corpus. For a given word, we count the number of times it occurs in one rapper’s catalogue (its term frequency) and divide by the number of artists that use it across the hip-hop corpus (its document frequency). The words with the ten highest tf-idf scores for each artist were deemed the words “most unique” to him or her.
We made two slight modifications to the traditional formula. 1) We used sublinear scaling on the term frequencies, giving us a little more variation across our lists. You can read more about why you might want to do sublinear scaling here. 2) We also set a “cut-off” for document frequency of 0.1. That means, to be considered in our tf-idf computation, a term had to be used at least once by 10% of the artists in our dataset. This rules out words that are repeated over and over by one or a few artists (think “controlla” for Drake).
Cosine Similarity: Cosine similarity is a common way of calculating the similarity between two vectors by taking the cosine of the angle between them. In our case, that means taking the tf-idf vector for an artist and comparing it to that of another. Higher cosine values imply more similarity, with an upper bound of 1 when the vectors are perfectly similar.
t-SNE: To create our map of rappers, we used a dimensionality reduction technique called t-SNE. We took the tf-idf matrix and first reduced it to 50 dimensions using Truncated singular value decomposition (SVD). We then took the resulting matrix and fed it into t-SNE with a perplexity parameter set to 40. The output of the t-SNE algorithm mapped rappers to a two-dimensional space based on the similarity of their lyrics.
Special thanks to Josh Upton for edits.