Ever since coming across Matt Daniel's Rapper Vocabulary Chart, I've been interested in how one of my favorite rappers -- Buck 65 -- would place on there. To find that, I'll be getting as many lyrics as I can from LyricsGenius to get up to 35,000 lyrics in accordance with the original methodology:
35,000 words covers 3-5 studio albums and EPs. I included mixtapes if the artist was just short of the 35,000 words. Quite a few rappers don’t have enough official material to be included (e.g., Biggie, Kendrick Lamar). As a benchmark, I included data points for Shakespeare and Herman Melville, using the same approach (35,000 words across several plays for Shakespeare, first 35,000 of Moby Dick).
I used a research methodology called token analysis to determine each artist’s vocabulary. Each word is counted once, so pimps, pimp, pimping, and pimpin are four unique words. To avoid issues with apostrophes (e.g., pimpin’ vs. pimpin), they’re removed from the dataset. It still isn’t perfect. Hip hop is full of slang that is hard to transcribe (e.g., shorty vs. shawty), compound words (e.g., king shit), featured vocalists, and repetitive choruses.
With those lyrics, I'll be cleaning the data to remove apostrophes and (possibly) other special characters, and then using NLTK to break the lyrics into tokens and count the number of individual words.
import lyricsgenius
import json
import pandas as pd
secrets_file = open('secrets.json')
secrets = json.load(secrets_file)
secrets_file.close()
def make_initial_dataframe(json_file):
f = open(json_file)
buck_json = json.load(f)
songs = pd.DataFrame(buck_json['songs'])
unneeded_cols = list(songs.columns.values)
# we only need these three values, so we drop the rest
unneeded_cols.remove('lyrics')
unneeded_cols.remove('title')
unneeded_cols.remove('release_date')
unneeded_cols.remove('album')
songs = songs.drop(unneeded_cols, axis=1)
songs.head()
return songs
def find_album(album):
return album['name']
def format_albums(songs):
albums = songs['album']
songs['album'] = albums.map(find_album, na_action='ignore')
songs.head()
return songs
def clean_lyrics(songs):
# remove remixes
remixes = songs['title'].str.contains('([rR]emix\)|\[Acoustic Version\])')
songs = songs[~remixes]
# remove newlines
songs['lyrics'] = songs['lyrics'].str.replace('[\n\t]', ' ')
# replace hyphens/dashes with spaces
songs['lyrics'] = songs['lyrics'].str.replace('[-–—]', ' ')
# remove all other punctuation
songs['lyrics'] = songs['lyrics'].str.replace('[^a-zA-Z0-9 ]', '')
songs['lyrics'] = songs['lyrics'].str.lower()
return songs
# set up genius api access
genius = lyricsgenius.Genius(secrets['CLIENT_ACCESS_TOKEN'])
genius.remove_section_headers = True
# this only needs to be run if Lyrics_Buck65.json doesn't exist
# it will also take a while
# buck = genius.search_artist("Buck 65")
# buck.save_lyrics()
# turn JSON into pandas dataframe
songs = make_initial_dataframe("Lyrics_Buck65.json")
# get the album name from the JSON
songs = format_albums(songs)
songs.sort_values(by='album', inplace=True)
songs.head(20)
# clean data
songs = clean_lyrics(songs)
# save
songs.to_csv('buck65.tsv', sep="\t", index=False)
songs['lyrics'].str.len().sum()
So far, we have gathered, sorted, and cleaned all of the lyrics from Buck 65's Genius entry. We can see that we have 297,778 individual lyrics across 163 songs. The first step then is to narrow that down to the 35,000 used in the original project. More specifically, we need to get the (chronologically) first 35,000 words. To do that, we'll need a list of his albums, which I've gotten from this Wikipedia article. The most relevant part of that article is pasted below:
Studio albums
Buck 65
Game Tight (1994)
Year Zero (1996)
Weirdo Magnet (1996)
Language Arts (1996)
Vertex (1998)
Man Overboard (Anticon, 2001)
Synesthesia (Endemik, 2001)
Square (WEA, 2002)
Talkin' Honky Blues (WEA, 2003)
Secret House Against the World (WEA, 2005)
Situation (Strange Famous, 2007)
20 Odd Years (WEA, 2011)
Laundromat Boogie (2014)
Neverlove (2014)
import nltk
nltk.download('punkt')
songs = pd.read_csv('buck65.tsv', sep="\t", keep_default_na=False)
# useful for coding next cell
songs['album'].value_counts()
# sort albums by album release
buck_albums = [
'Game Tight',
'Year Zero',
'Weirdo Magnet',
'Language Arts',
'Vertex',
'Man Overboard',
'Synesthesia',
'Square',
'Talkin’ Honky Blues',
'Secret House Against The World',
'Situation ', # space is there on purpose
'20 Odd Years',
'Laundromat Boogie',
'Neverlove (Deluxe Edition)'
]
albums_to_remove = [
'20 Odd Years: Volume 4 - Ostranenie',
'Dirtbike 3',
'Dirtywork E.P.',
'I Dream Of Love: Live And In Private',
'Boy-Girl Fight',
'Pole-Axed (More Rarities)',
'Year Of The Carnivore Soundtrack',
'Climbing Up a Mountain With a Basket Full of Fruit',
'Dirtbike',
'Skratch Bastid Presents: Cretin Hip Hop Vol. 1 (Buck 65 Mixtape)',
'Weirdo Magnet', # yes this is a studio album, but there's only one song in it and it's full of notes
'This Right Here is Buck 65', # best hits album
'Giga Single',
'Dark Was the Night',
'', # removes songs with no album
]
def chron_order_albums(songs, albums):
songs['ordered_album'] = pd.Categorical(
songs['album'],
categories=albums,
ordered=True
)
return songs.sort_values(by='ordered_album')
songs = chron_order_albums(songs, buck_albums)
for alb in albums_to_remove:
remove_bool = songs['album'] == alb
songs = songs[~remove_bool]
songs.tail(30)
Earlier on, we did a first round of data cleaning: removing special characters, lowercasing all words, etc. However, a second round was needed to remove rows. The original methodology only included studio albums unless there weren't enough for 35,000 words, in which case other materials were considered. This isn't the case for Buck 65, so I've had to remove some songs. I did so based on the album
column, and the removals fell into roughly three categories:
Mixtapes, singles and unreleased material: This is most of the removal list, and while it unfortunalely removes some of my favorite material (RIP Dirtbike), it was important to take it out to be consistant with the original methodology.
'Problem Albums': There were two of these: Weirdo Magnet and This Right Here is Buck 65. Weirdo Magnet had to go because the lyrics were woefully incomplete (only one song was on Genius) and the lyrics for it had comments in it. This Right Here is Buck 65 is a best of album, so it inflated the total word count without adding any unique words, so it made more sense to remove it.
Songs with no album: a lot of these fall under mixtapes, singles, and unreleased material, but just weren't marked as such. There may have been some valuable songs in there (e.g. more of Weirdo Magnet) but going through and manually addojg albums was going to be a pain, so I decided to exclude them.
def get_unique_lyrics(tokens):
return len(set(tokens))
def tokenize_lyrics(songs):
lyrics = songs['lyrics']
lyric_string = lyrics.str.cat()
return nltk.word_tokenize(lyric_string)
lyric_tokens = tokenize_lyrics(songs)
print('Total Lyrics'len(lyric_tokens))
# get unique words in first 35,000 lyrics
limited_tokens = lyric_tokens[:34999]
print('First 35,000 Unique Lyrics:', get_unique_lyrics(limited_tokens))
The above cell gives us Buck 65's vocabulary according to Daniel's first 35,000 word methodology: 6,557 unique words. This puts him in a solid 3rd place. Ahead of Jedi Mind Tricks at 6,424, but still well behind Busdriver and Aesop Rock. While this gives us our answer, just for fun, I wanted to see how sensitive that result would be to changing the sample.
#all lyrics
print('Total 35,000 Unique Lyrics:'get_unique_lyrics(lyric_tokens))
# last words
last_tokens = lyric_tokens[-35000:]
print('Last 35,000 Unique Lyrics:', get_unique_lyrics(last_tokens))
# random samplings
from random import sample
from statistics import mean
def sample_lyrics(songs):
counter = 0
results = []
while counter < 10:
lyric_sample = sample(lyric_tokens, 35000)
uniques = get_unique_lyrics(lyric_sample)
results.append(uniques)
counter += 1
return results
sample_results = sample_lyrics(songs)
print('Random 35,000 Unique Lyrics:')
print(sorted(sample_results))
print(mean(sample_results))
When using his whole corpus of 43,052 words, we find 7,521 unique ones. Using his last 35,000 words gets us 6,557 unique words, implying a slight decrease in vocabulary over time. I'd assume that a lot of that is due to the inclusion of Neverlove, which was a far poppier, less dense-sounding album than many of his early works. Finally, using a series of random samplings of 35,000 words, we get results that tend to average out in the high 6,600s, but reaching down to the 6,590s and up to the low 6,700s.
While this is another good result, I have a hypothesis that these numbers will all go up noticably if I include two (especially poetic IMO) albums which he recorded as part of a collaboration with DJ Greetings from Tuskan.
# this only needs to be run if Lyrics_BikeForThree.json doesn't exist
# it will also take a while
# bike = genius.search_artist("Bike for Three!")
# bike.save_lyrics()
bike = make_initial_dataframe('Lyrics_BikeForThree.json')
bike.head()
bike = format_albums(bike)
bike.head()
bike.sort_values(by='album', inplace=True)
bike.head(20)
# clean data
bike = clean_lyrics(bike)
# save
bike.to_csv('bike.tsv', sep="\t", index=False)
bike = pd.read_csv('bike.tsv', sep="\t", keep_default_na=False)
bike_tokens = tokenize_lyrics(bike)
lyric_tokens += bike_tokens
# get updated total unique lyrics
get_unique_lyrics(lyric_tokens)
# only run this once per kernel
songs = songs.append(bike)
# buck_albums from above with Bike For Three's albums inserted
all_albums = [
'Game Tight',
'Year Zero',
'Weirdo Magnet',
'Language Arts',
'Vertex',
'Man Overboard',
'Synesthesia',
'Square',
'Talkin’ Honky Blues',
'Secret House Against The World',
'Situation ',
'More Heart than Brains',
'20 Odd Years',
'So Much Forever ',
'Laundromat Boogie',
'Neverlove (Deluxe Edition)'
]
songs = chron_order_albums(songs, all_albums)
songs.tail(50)
lyric_tokens_all = tokenize_lyrics(songs)
print('All Lyrics:', len(lyric_tokens_all))
print('All Unique Lyrics:', get_unique_lyrics(lyric_tokens_all))
# get unique words in first 35,000 lyrics
limited_tokens_all = lyric_tokens_all[:34999]
print('First 35,000 Unique Lyrics:', get_unique_lyrics(limited_tokens_all))
last_tokens = lyric_tokens[-35000:]
print('Last 35,000 Unique Lyrics:', get_unique_lyrics(last_tokens))
sample_results_all = sample_lyrics(songs)
print('Random 35,000 Unique Lyrics:')
print(sorted(sample_results_all))
print(mean(sample_results_all))
The effect from adding the two Bike for Three albums was surprising to say the least. I was expecting a large increase in unique words, but other than in the total, adding these albums actually caused a slight decrease in all three samples. It's not enough of a decrease in the first 35,000 sample to knock Buck out of 3rd place, but still notable.
Using Matt Daniel's methodology, I've analyzed the number of unique words that Canadian rapper Buck 65 has used in his first 35,000 lyrics. It's 6,557 unique words, which would put him in 3rd place on the chart. While this number does change slightly depending on the sample and on the inclusion of other albums, those changes keep him comfortably in 3rd place.