Billboard data analysis in R (1958–2019). Part 1

Music is an integral part of our everyday life. Given the growth of the music market and its datafication, the music industry represents a lot of data opportunities to work with. Based on this dataset in Part 1 I want to investigate various music characteristics of Billboard tracks via explorative analysis. The goal of this analysis is to provide music industry players with important insights on successful songs that made it to Weekly Hot 100 singles chart by Billboard between 1958 and 2019 as well as highlight some general trends and patterns on Billboad Hot 100 over years.

For the record: Billboard Hot 100 is the music industry standard record chart in the United States for songs, published weekly by Billboard magazine. Chart rankings are based on sales (physical and digital), radio play, and online streaming in the United States.

Summary

After importing the data, the extensive data transformation is carried out to account for the variety of tasks to be performed later on. The explorative analysis is focused on the distribution of music features across Billboard songs, their variation over years and months as well as the valence-arousal model applied in the Billboard context. Then the most successful Billboard artists and songs are discovered. Lastly, the insights from the explorative analysis are summarized together with some limitations and further research recommendations.

Packages

The following packages are necessary for further data analysis.

library(dplyr) 
library(ggplot2)
library(tidyverse)
library(gridExtra)
library(viridisLite)
library(cowplot)
library(ggalt)
library(viridis)
library(ggridges)
library(readxl)

Importing data

To begin with, we need to download the necessary data. For the explorative analysis we use 2 datasets: Billboard Hot 100 data (songs, performers, weeks, weeks on chart, peak position) and Spotify audio features of corresponding tracks (track popularity, danceability, energy, etc.).

spotify_audio_features <- read_excel("~/Desktop/R practice/Hot 100 Audio Features.xlsx")
billboard_hot100 <- read.csv("~/Desktop/R practice/Hot Stuff.csv")

After importing the datasets we can merge them by a song and performer, as some artists have songs with the same names.

billboard_total <- merge(spotify_audio_features, billboard_hot100, by=c("Song", "Performer"))

Let’s view the dataset.

str(billboard_total)

In total, now we have 533,479 observations and 31 variables. Note that some songs repeatedly appear in the dataset, as they re-enter Hot 100 by Billboard from time to time, which explains a high number of observations.

Data wrangling

For further analysis we need to delete NA’s and unnecessary columns, namely “SongID.x”, “spotify_genre”, “spotify_track_id”, “spotify_track_preview_url”, “url”, “SongID.y” and “Instance”.

billboard_total <- na.omit(billboard_total)
billboard_total <- billboard_total[,-c(3, 4, 5, 6, 23, 26, 27)]

In addition, we can rename some columns.

billboard_total <- billboard_total %>% 
rename(
song = Song,
performer = Performer,
track_album = spotify_track_album,
track_explicit = spotify_track_explicit,
track_duration_ms = spotify_track_duration_ms,
date = WeekID,
week_position = Week.Position,
previous_week_position = Previous.Week.Position,
peak_position = Peak.Position,
weeks_on_chart = Weeks.on.Chart
)

And we can also rename factors of the variables “mode” and “key” and combine them, as usually music notations include both keys and modes together.

billboard_total$mode <- factor(billboard_total$mode, levels = c(0:1), labels = c("minor", "major"))
billboard_total$key <- factor(billboard_total$key, levels = c(0:11), labels = c("C", "C#","D", "D#", "E", "F", "F#", "G", "G#", "A", "A#", "B"))
billboard_total <- unite(billboard_total , key_signature, key, mode, sep = " ")

Moreover, we can convert the duration of tracks from miliseconds to seconds, to make the interpretation more straightforward.

billboard_total$track_duration_sec <-billboard_total$track_duration_ms / 1000
billboard_total <- billboard_total[,-c(5)]
billboard_total <- billboard_total %>%
relocate(track_duration_sec, .after = time_signature)

For further analysis we can also create a new variable “song_performer”.

song_performer <- billboard_total %>%
select(song, performer) %>%
unite("song_performer", song:performer, sep = " | ")
billboard_total <- cbind(billboard_total, song_performer)billboard_total <- billboard_total %>%
relocate(song_performer, .before = song)

Finally, we create a year variable. In this case the format of the date variable has to be changed and then the year can be extracted.

billboard_total$date <-  as.Date(billboard_total$date, format = "%m/%d/%Y")

billboard_total$year <- as.numeric(format(billboard_total$date, "%Y"))

Now let’s check the final dataset.

str(billboard_total)

The final dataset contains 24 variables and 149,926 observations for the period 1958–2019. Some of the variables in the data set can be explained in the following way:

Acousticness: confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.

Energy: energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale.

Key: the estimated overall key of the track. Integers map to pitches using standard Pitch Class notation, for example, 0 = C, 1 = C#/Db, 2 = D, and so on. If no key was detected, the value is -1.

Mode: mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.

Liveness: detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.

Loudness: overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.

Speechiness: speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.

Valence: measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g., happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g., sad, depressed, angry).

Duration: duration of the song in seconds.

Explorative data analysis

After preparing the dataset we can now look more in detail at its different facets.

Distribution of variables

To check the distribution of variables, we need to use the unique songs data based on song_performer first (as some artists have similar songs names), otherwise the analysis will be skewed, since many tracks are featured on Billboard charts numerous times.

billboard_total_distinct  <- distinct(billboard_total, song_performer, .keep_all = TRUE)

In this article I will use both histograms and density plots to check the distribution of variables in the dataset. Firstly, let’s have a look at the distribution of music features measured on the scale 0–1.

ggplot(billboard_total_distinct) +
geom_density(aes(danceability, fill ="danceability", alpha = 0.1)) +
geom_density(aes(energy, fill ="energy", alpha = 0.1)) +
geom_density(aes(speechiness, fill ="speechiness", alpha = 0.1)) +
scale_x_continuous(name = "Danceability, Energy, Speechiness") +
scale_y_continuous(name = "Density") +
ggtitle("Danceability, Energy and Speechiness of Hot 100 Songs") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 12, face = "bold"),
text = element_text(size = 10)) +
theme(legend.title=element_blank()) +
scale_fill_brewer(palette="Accent")
ggplot(billboard_total_distinct) +
geom_density(aes(acousticness, fill ="acousticness", alpha = 0.1)) +
geom_density(aes(valence, fill ="valence", alpha = 0.1)) +
geom_density(aes(liveness, fill ="liveness", alpha = 0.1)) +
scale_x_continuous(name = "Acousticness, Liveness, Valence") +
scale_y_continuous(name = "Density") +
ggtitle("Acousticness, Valence and Liveness of Hot 100 Songs") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 12, face = "bold"),
text = element_text(size = 10)) +
theme(legend.title=element_blank()) +
scale_fill_brewer(palette="Accent")

Based on these plots a few things about Billboard Hot 100 songs can be noted:

  • most songs have quite a high danceability, between 0.5 and 0.75;
  • the distribution of energy across songs is almost normal, meaning that songs range from up-beat to slow and calm options;
  • a prevailing majority of tracks have speechiness between 0 and 0.1, meaning that a lot of tracks have very few lyrics;
  • acousticness of songs is also quite left-skewed. Highly acoustic songs mainly contain orchestral instruments, an unaltered voice, acoustic guitars and natural drum kits, whereas less acoustic songs contain, for example, synthesizers, electric guitars and amplified instruments;
  • liveness is very left-skewed, which makes sense as most of the tracks on Billboard are studio recordings, with a low probability of audience being present;
  • valence of most of the songs is higher than 0.5, meaning that most Billboard tracks are indeed positive.

Now we can have a look at the distribution of loudness, duration in seconds, time signature and tempo.

loudness_density <- ggplot(billboard_total_distinct) +
geom_density(aes(loudness, fill ="loudness")) +
scale_x_continuous(name = "Loudness") +
scale_y_continuous(name = "Density") +
ggtitle ("Loudness of Hot 100 Songs") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 10, face = "bold"),
text = element_text(size = 10), legend.position = 'none') +
theme(legend.title = element_blank()) +
scale_fill_brewer(palette = "Dark2")
duration_density <- ggplot(billboard_total_distinct) +
geom_density(aes(track_duration_sec, fill ="duration")) +
scale_x_continuous(name = "Duration") +
scale_y_continuous(name = "Density") +
ggtitle ("Duration of Hot 100 Songs") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 10, face = "bold"),
text = element_text(size = 10), legend.position = 'none') +
theme(legend.title = element_blank()) +
scale_fill_brewer(palette = "Paired")
time_signature_density <- ggplot(billboard_total_distinct) +
geom_density(aes(time_signature, fill ="time signature")) +
scale_x_continuous(name = "Time signature") +
scale_y_continuous(name = "Density") +
ggtitle ("Time Signature of Hot 100 Songs") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 10, face = "bold"),
text = element_text(size = 10), legend.position = 'none') +
theme(legend.title = element_blank()) +
scale_fill_brewer(palette = "RdBu")
tempo_density <- ggplot(billboard_total_distinct) +
geom_density(aes(tempo, fill ="tempo")) +
scale_x_continuous(name = "Tempo") +
scale_y_continuous(name = "Density") +
ggtitle ("Tempo of Hot 100 Songs") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 10, face = "bold"),
text = element_text(size = 10), legend.position = 'none') +
theme(legend.title = element_blank()) +
scale_fill_brewer(palette = "viridis")
grid.arrange(loudness_density, duration_density,time_signature_density, tempo_density, nrow = 4)

Based on these graphs the following insights can be derived:

  • most of the Billboard tracks indeed are moderately loud;
  • their duration is on average 225 seconds (3 minutes and 45 seconds);
  • the significant majority of tracks has a time signature of 4 or four quarter note beats, which is very common in the music industry;
  • most of the tracks have the tempo of around 125 beats per minute (BPM).

According to the distribution of track popularity, the mean track popularity is around 40–50, with a lot of tracks having a popularity around 50, as measured by Spotify.

ggplot(billboard_total_distinct, aes(x=spotify_track_popularity)) +
geom_histogram(bins = 30, fill="#00AFBB", alpha = 0.4, color = "#00AFBB")+
geom_vline(xintercept = mean(billboard_total_distinct$spotify_track_popularity), color = "red", linetype = "dashed")+
theme_minimal() +
theme (plot.title = element_text(hjust = 0.5, size = 12, face = "bold")) +
labs(title = "Distribution of Spotify Track Popularity", x="Spotify track popularity",y="Count")

Now we also can plot the distribution of key and mode across all tracks. To see how the maximum bar can be highlighted in ggplot, please check this link.

billboard_total_distinct %>%
mutate(highlight_key = ifelse(key_signature == 'C major', T, F)) %>%
ggplot(aes(key_signature))+
geom_bar(aes(fill = highlight_key))+
scale_fill_manual(values = c('dark blue', 'dark red')) + #choosing colors for bars
theme_minimal() +
theme (plot.title = element_text(hjust = 0.5, size = 12, face = "bold"), legend.position = 'none',
panel.grid = element_line(size = 0.25, linetype = 'solid',
colour = "light grey"),
axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))+
labs(title = "Distribution of Key and Mode among Hot 100 Songs", x="Key",y="Count")

The most popular keys among Billboard tracks are C major, G major and D major. It’s logical that major keys dominate the distribution as they are perceived to be happy, while minor keys are perceived to be sad. According to Christian Schubart’s description of such musical keys, C major is characterized as “completely pure, it’s character is innocence, simplicity, naïvety, children’s talk”, while for G major “every gentle and peaceful emotion of the heart is correctly expressed” and D major is associated with “the inviting symphonies, the marches, holiday songs and heaven-rejoicing choruses”.

Variation of some music features over time (1958–2019)

Next we can look at the variance of music features during 1958–2019 to discover some general patterns. Firstly, let’s summarize the mean danceability for Billboard songs per each year. Here we also make sure to use the data grouped by song_performer, to avoid duplicated records of music features.

danceability_trend <- billboard_total_distinct %>%
select(danceability, year) %>%
group_by(year) %>%
summarize(danceability_mean = mean(danceability))
ggplot(danceability_trend, aes(x = year, y = danceability_mean)) +
geom_line(color = "#00AFBB", size = 1) +
stat_smooth(
color = "#FC4E07", fill = "#FC4E07",
method = "loess" # adding the smooth line
) + theme_minimal() +
theme (plot.title = element_text(hjust = 0.5, size = 12, face = "bold")) +
labs(title = "Danceability of Hot 100 Songs (1958-2019)", x="Year",y="Mean danceability")

Interestingly, the danceability of Billboard songs has increased from 0.55 in 1958 to almost 0.67 in 2019. Therefore, we can imply that music preferences of consumers have shifted more towards danceable music, which resulted in such songs getting on Billboard.

We can perform the same actions for other variables, for example, energy, valence, acousticness and loudness.

energy_trend <- billboard_total_distinct %>%
select(energy, year) %>%
group_by(year) %>%
summarize(energy_mean = mean(energy))
valence_trend <- billboard_total_distinct %>%
select(valence, year) %>%
group_by(year) %>%
summarize(valence_mean = mean(valence))
acousticness_trend <- billboard_total_distinct %>%
select(acousticness, year) %>%
group_by(year) %>%
summarize(acousticness_mean = mean(acousticness))
loudness_trend <- billboard_total_distinct %>%
select(loudness, year) %>%
group_by(year) %>%
summarize(loudness_mean = mean(loudness))

And here is the visualization itself.

ggplot(energy_trend, aes(x = year, y = energy_mean)) + 
geom_line(color = "#00AFBB", size = 1) +
stat_smooth(
color = "#FC4E07", fill = "#FC4E07",
method = "loess"
) + theme_minimal() +
theme (plot.title = element_text(hjust = 0.5, size = 12, face = "bold")) +
labs(title = "Energy of Hot 100 Songs (1958-2019)", x="Year",y="Mean energy")
ggplot(valence_trend, aes(x = year, y = valence_mean)) + 
geom_line(color = "#00AFBB", size = 1) +
stat_smooth(
color = "#FC4E07", fill = "#FC4E07",
method = "loess"
) + theme_minimal() +
theme (plot.title = element_text(hjust = 0.5, size = 12, face = "bold")) +
labs(title = "Valence of Hot 100 Songs (1958-2019)", x="Year",y="Mean valence")
ggplot(acousticness_trend, aes(x = year, y = acousticness_mean)) + 
geom_line(color = "#00AFBB", size = 1) +
stat_smooth(
color = "#FC4E07", fill = "#FC4E07",
method = "loess"
) + theme_minimal() +
theme (plot.title = element_text(hjust = 0.5, size = 12, face = "bold")) +
labs(title = "Acousticness of Hot 100 Songs (1958-2019)", x="Year",y="Mean acousticness")
ggplot(loudness_trend, aes(x = year, y = loudness_mean)) + 
geom_line(color = "#00AFBB", size = 1) +
stat_smooth(
color = "#FC4E07", fill = "#FC4E07",
method = "loess"
) + theme_minimal() +
theme (plot.title = element_text(hjust = 0.5, size = 12, face = "bold")) +
labs(title = "Loudness of Hot 100 Songs (1958-2019)", x="Year",y="Mean loudness")

Based on these plots, we can highlight a few striking things:

  • the energy of Billboard hits has increased substantially, meaning that songs became more fast and up-beat;
  • the valence of songs between 1958 and 2019 has decreased, which leads to the conclusion that Billboard songs became more sad;
  • the acousticness of Billboard hits also dropped significantly, which, of course, makes sense, as songs have been using more and more synthesizers and altered voices instead of original voices and orchestral instruments;
  • the loudness of Billboards songs has increased, meaning that songs got more quiet to some extent.

Last but not least, we can investigate if Billboard hits became shorter over time.

duration_trend <- billboard_total_distinct %>%
select(track_duration_sec, year) %>%
group_by(year) %>%
summarize(duration_mean = mean(track_duration_sec))
ggplot(duration_trend, aes(x = year, y = duration_mean)) +
geom_line(color = "#00AFBB", size = 1) +
stat_smooth(
color = "#FC4E07", fill = "#FC4E07",
method = "loess"
) + theme_minimal() +
theme (plot.title = element_text(hjust = 0.5, size = 12, face = "bold")) +
labs(title = "Duration of Hot 100 Songs (1958-2019)", x="Year",y="Mean duration")

As we see, the duration of Billboard hits was increasing till the 90’s, after which it started dropping. The finding is in line with previous research on the duration of tracks and its transformation over time. For example, according to UK record label Ostereo, the length of the average number one song has shrunk by almost a fifth over the past two decades. Ostereo suggested that streaming platform algorithms are influencing song length and encouraging artists to record shorter songs. As more people skip before a song has ended, streaming algorithms may see this as a signal of dissatisfaction. Consequently, they are less likely to recommend a longer song that has been skipped to other users, which means it is less likely to become popular.

Variation of some music features over months (1999–2019)

The major purpose of this part is to see the variation of music features within Billboard songs during months over 1999–2019. Firstly, we should extract the month variable.

billboard_total$month <- format(billboard_total$date, "%m")

Then we can subset the necessary data on energy, for example.

energy_month <- billboard_total %>%
select(energy, month, year) %>%
filter(year %in% c(1999:2019)) %>%
group_by(month) %>%
arrange(month)

Now we can find the mean of energy for each month and each year.

energy_month <- energy_month %>%
group_by(month, year) %>%
summarize(mean_energy = mean(energy, na.rm = TRUE))

For the visualization in this case we can use boxplots. A boxplot is a standardized way of displaying the distribution of data based on a five number summary (“minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum”). It can tell us about outliers and what their values are.

Here we see that the energy of Billboard hits from 1999 till 2019 is higher during April-September, compared with January, February and March. It’s makes sense as during spring and summer songs that are more upbeat, danceable and happy tend to gain more popularity. The so-called “summer-hit” phenomenon is also explained in this article by NY Times. Relatively high energy of songs in December can be explained by the popularity of Christmas tracks at this time.

What about danceability and valence?

danceability_month <- billboard_total %>%
select(danceability, month, year) %>%
filter(year %in% c(1999:2019)) %>%
group_by(month) %>%
arrange(month)

danceability_month <- danceability_month %>%
group_by(month, year) %>%
summarize(mean_danceability = mean(danceability, na.rm = TRUE))

ggplot(danceability_month, aes(month, mean_danceability, col = month))+
geom_boxplot()+
geom_point(position=position_jitterdodge()) +
theme_minimal() +
theme (plot.title = element_text(hjust = 0.5, size = 12, face = "bold"), legend.position = 'none') + labs(title ="Danceability of Number 1 Hot 100 Hits over Months (1999-2019)", x="Month",y="Mean danceability")

Mean danceability of Billboard hits does not differ significantly during summer months. The reason for this phenomenon can be that labels produce more and more tracks with higher danceability, as such songs tend to perform better on streaming platforms. Secondly, one can also claim that hit-songs become incrementally similar over the time (Thompson and “Shazam Effect”). However, according to another research, “songs that sound too much like previous and contemporaneous productions — those that are highly typical — are less likely to succeed".

valence_month <- billboard_total %>%
select(valence, month, year) %>%
filter(year %in% c(1999:2019)) %>%
group_by(month) %>%
arrange(month)

valence_month <- valence_month %>%
group_by(month, year) %>%
summarize(mean_valence = mean(valence, na.rm = TRUE))

ggplot(valence_month, aes(month, mean_valence, col = month))+
geom_boxplot()+
geom_point(position=position_jitterdodge()) + # dodged boxplots
theme_minimal() +
theme (plot.title = element_text(hjust = 0.5, size = 12, face = "bold"), legend.position = 'none') + labs(title = "Valence of Number 1 Hot 100 Hits over Months (1999-2019)", x="Month",y="Mean valence")

With valence (=positiveness) we can clearly see that its higher during summer months than in spring (in line with the previous finding on energy). In addition, the valence is quite high during autumn months. In my opinion, it can be explained by the fact that people indeed might listen to more happy songs (i.e., songs with higher valence) during winter-autumn, to cheer themselves up, for instance, when the weather is not good or because of the Christmas season.

Valence-arousal model

In many emotional music systems two dimensions are used: valence and arousal. In the model, emotions are plotted on a graph with the first dimension being how positive or negative the emotion is (valence), and the second dimension being how intense the physical arousal of the emotion is (arousal). For example “happy” is high valence and high arousal affective state, while “stressed” is low valence and high arousal state.

Based on this information I decided to demonstrate the valence-arousal model in terms of Billboard hits, based on such music features as valence and energy (=arousal). For this purpose the four quadrants were determined:

  • Q1 — “happy”: valence > 0.5, arousal (energy) > 0.5;
  • Q2 — “excited”: valence <= 0.5, arousal (energy) > 0.5;
  • Q3 — “sad”: valence <= 0.5, arousal (energy) <= 0.5;
  • Q4 — “peaceful”: valence > 0.5, arousal (energy) <= 0.5.

To begin with, let’s look at the distribution of Billboard top hits (that reached position 1) according to their valence-arousal across the whole time period covered by the dataset (1958–2019). Again, as some tracks are featured on Billboard numerous times, the grouped-by-song_performer subset is used.

billboard_total_distinct %>%
filter(peak_position == 1) %>%
mutate(quadrant = case_when(energy > 0.5 & valence > 0.5 ~ "happy",
energy <= 0.5 & valence > 0.5 ~ "peaceful",
energy <= 0.5 & valence <= 0.5 ~ "sad",
TRUE ~ "excited")) %>%
ggplot(aes(x = valence, y = energy, color = quadrant)) +
geom_vline(xintercept = 0.5) + # plot vertical line
geom_hline(yintercept = 0.5) + # plot horizontal line
geom_point() +
theme_minimal() +
theme (plot.title = element_text(hjust = 0.5, size = 12, face = "bold"), plot.subtitle = element_text(hjust = 0.5, size = 10, face = "italic")) +
labs(title = "Valence-Arousal Model (1958-2019)", x = "Valence", y = "Arousal (Energy)", subtitles = "Based on Top Hits from Hot 100") +
guides(color=guide_legend("Quadrant")) # add guide properties by aesthetic

Clearly most of Billboard hits can be considered as “happy”, with a significant share of “sad” tracks. But is the distribution the same for the last 20 years?

billboard_total_distinct %>%
filter(peak_position == 1 & year %in% c(1999:2019)) %>%
mutate(quadrant = case_when(energy > 0.5 & valence > 0.5 ~ "happy",
energy <= 0.5 & valence > 0.5 ~ "peaceful",
energy <= 0.5 & valence <= 0.5 ~ "sad",
TRUE ~ "excited")) %>%
ggplot(aes(x = valence, y = energy, color = quadrant)) +
geom_vline(xintercept = 0.5) + # plot vertical line
geom_hline(yintercept = 0.5) + # plot horizontal line
geom_point() +
theme_minimal() +
theme (plot.title = element_text(hjust = 0.5, size = 12, face = "bold"), plot.subtitle = element_text(hjust = 0.5, size = 10, face = "italic")) +
labs(title = "Valence-Arousal Model (1999-2019)", x = "Valence", y = "Arousal (Energy)", subtitles = "Based on Top Hits from Hot 100") +
guides(color=guide_legend("Quadrant")) + # add guide properties by aesthetic
scale_y_continuous(limits=c(0, 1)) #re-scaling the y-axis

Here we can state that during 1999–2019 the prevailing majority of top Billboard hits from Hot 100 can be considered “excited” and “happy”, with very few “sad” and “peaceful” tracks, although during the whole period of comparison (1958–2019) the share of sad top hits is quite high (the previous plot).

Number of artists and tracks by years

To explore the top tracks and performers as well as their variation of years I decided to go back to the initial Billboard Hot 100 dataset, as the final dataset used in this article was significantly reduced due to NA’s in the Spotify features of many tracks. Of course, we still could use the final reduced dataset, however, then the findings on top tracks and performers from Billboard Hot 100 would not be as representative. To start working with Billboard Hot 100 only data, we need to transform it, following the steps mentioned before.

billboard_hot100 <- na.omit(billboard_hot100)

billboard_hot100 <- billboard_hot100[,-c(1, 7, 6)]

In addition, we can rename and relocate some columns.

billboard_hot100 <- billboard_hot100 %>% 
rename(
song = Song,
performer = Performer,
date = WeekID,
week_position = Week.Position,
previous_week_position = Previous.Week.Position,
peak_position = Peak.Position,
weeks_on_chart = Weeks.on.Chart
)

billboard_hot100 <- billboard_hot100 %>%
relocate(date, .after = performer)

billboard_hot100 <- billboard_hot100 %>%
relocate(week_position, .after = date)

As before, we can also create a new variable “song_performer”.

song_performer <- billboard_hot100 %>%
select(song, performer) %>%
unite("song_performer", song:performer, sep = " | ")
billboard_hot100 <- cbind(billboard_hot100, song_performer)

billboard_hot100 <- billboard_hot100 %>%
relocate(song_performer, .before = song)

Finally, we can create a year variable.

billboard_hot100$date <-  as.Date(billboard_hot100$date, format = "%m/%d/%Y")

billboard_hot100$year <- as.numeric(format(billboard_hot100$date, "%Y"))

Now we can start the analysis itself. Firstly, we can discover the number of unique artists who have entered Billboard charts every year since 1958.

performers_years <- billboard_hot100 %>%
select(performer, year) %>%
group_by(year) %>%
summarise(unique_performers = n_distinct(performer))

Here we will use a lollipop plot. More information on its syntax can be found here.

ggplot(performers_years, aes(x = year, y = unique_performers)) +
geom_segment( aes(x=year, xend=year, y=0, yend=unique_performers), color="grey") +
geom_point(color="dark blue", size=3) +
theme_minimal() +
theme (plot.title = element_text(hjust = 0.5, size = 12, face = "bold"), legend.position = 'none') + labs(title = "Number of Unique Artists on Hot 100 per Year (1958-2019)", x="Year",y="Count")

As we see, the number of unique artists on Billboard charts has slowly decreased over time. But what about tracks?

songs_years <- billboard_hot100 %>%
select(year, song_performer) %>%
group_by(year) %>%
summarise(unique_songs = n_distinct(song_performer))

ggplot(songs_years, aes(x = year, y = unique_songs)) +
geom_segment( aes(x=year, xend=year, y=0, yend=unique_songs), color="grey") +
geom_point( color="dark blue", size=3) +
theme_minimal() +
theme (plot.title = element_text(hjust = 0.5, size = 12, face = "bold"), legend.position = 'none') + labs(title = "Number of Unique Songs on Hot 100 per Year (1958-2019)", x="Year",y="Count")

Here we see a very steep decrease in the number of unique songs that entered Billboard, from almost 800 in the 60’s to between 400 and 500 in the 21st century. It also makes sense to look if the same pattern can be seen on number-one hits.

songs_years_1 <- billboard_hot100 %>%
select(song_performer, year, peak_position) %>%
filter(peak_position == 1) %>%
group_by(year) %>%
summarise(unique_songs = n_distinct(song_performer))

ggplot(songs_years_1, aes(x = year, y = unique_songs)) +
geom_segment( aes(x=year, xend=year, y=0, yend=unique_songs), color="grey") +
geom_point( color="dark blue", size=3) +
theme_minimal() +
theme (plot.title = element_text(hjust = 0.5, size = 12, face = "bold"), legend.position = 'none') + labs(title = "Number of Top Hot 100 Hits per Year (1958-2019)", x="Year",y="Count")

It’s obvious that the number of top hits on Billboard has reduced over time as well. Nevertheless, during certain periods the number of top billboard hits is significantly higher. According to Ordanini and Nunes (2015), these sharp increases can be attributed to technological advancements in the recorded music market. “The first turning point occurs when the CD initially popularized digital audio in line with market statistics (RIAA) in 1986. The second turning point is anchored on the rise of MP3s and P2P file sharing (1999). The third turning point marks the growing success of legitimate digital downloads, corresponding with the advent of the most important downloading music service (i.e., iTunes) in 2004”. These three turning points can be clearly seen in the following chart.

ggplot(songs_years_1, aes(x = year, y = unique_songs)) +
geom_segment( aes(x=year, xend=year, y=0, yend=unique_songs), color="grey") +
geom_point( color="dark blue", size=3) +
scale_x_continuous(breaks = c(1960, 1970, 1980, 1990, 2000, 2010, 2020)) +
geom_vline(xintercept=1986, linetype="dashed", color = "red") +
geom_vline(xintercept=1999, linetype="dashed", color = "red") +
geom_vline(xintercept=2004, linetype="dashed", color = "red") +
theme_minimal() +
theme (plot.title = element_text(hjust = 0.5, size = 12, face = "bold"), legend.position = 'none') + labs(title = "Number of Top Hot 100 Hits per Year (1958-2019)", x="Year",y="Count")

Finally, we can also have a look at the average number of weeks songs spent on Billboard Hot 100 over 1958–2019.

weeks_years <- billboard_hot100 %>%
select(song_performer, year, weeks_on_chart) %>%
group_by(year) %>%
summarise(weeks = mean(weeks_on_chart))
ggplot(weeks_years, aes(x = year, y = weeks)) +
geom_segment( aes(x=year, xend=year, y=0, yend=weeks), color="grey") +
geom_point( color="dark blue", size=3) +
theme_minimal() +
theme (plot.title = element_text(hjust = 0.5, size = 12, face = "bold"), legend.position = 'none') + labs(title = "Average Number of Weeks on Hot 100 per Year (1958-2019)", x="Year",y="Count")

Surprisingly the average number of weeks hits spent on Billboard rose from 6.4 weeks in 1958 to 13.9 weeks in 2019. But what are the possible explanation for the decreasing number of unique artists, songs, top-1 songs over time, while the average number of weeks spent on Billboard has risen substantially?
One of the explanations is demonstrated with the superstar effect by Rosen (1981). Rosen (1981) describes the phenomenon of superstars in terms of a “relatively small numbers of people who earn enormous amounts of money and dominate the activities in which they engage”. The success of such superstars can be due to the exceptional talent of artists and the scale economies benefitting artists who can cater for larger audiences.
The continuation of this theory is the winner-take-all effect, whereby a few winners (songs) capture a disproportionately large share of the market and go on to become blockbusters. Consequently, the hyper-efficient digital market for music has led to greater convergence with fewer extraordinarily popular songs (blockbusters) and a smaller number of artists who perform them (superstars).

Both the superstar theory and winner-take-all effect in the Billboard context are highlighted by the plots above. The following bubble chart demonstrates the number of unique songs over 1958–2019 and the average number of weeks they spent on Hot 100. An interesting finding for further exploration!

songs_years_weeks <- billboard_hot100 %>%
select(song, weeks_on_chart, year) %>%
group_by(year) %>%
mutate(n_songs = n_distinct(song)) %>%
arrange(year)

songs_years_weeks <- songs_years_weeks %>%
group_by(year) %>%
mutate(mean_weeks = mean(weeks_on_chart)) %>%
select(-weeks_on_chart, -song) %>%
distinct(year, .keep_all = TRUE)

ggplot(songs_years_weeks, aes(x = year, y = n_songs, size = mean_weeks)) +
geom_point(shape = 21, colour = "black",
fill = "violetred1")+ #a circle that allows different colours for the outline and fill
scale_x_continuous(breaks = seq(1960, 2020, 10)) +
theme_minimal() +
theme (plot.title = element_text(hjust = 0.5, size = 12, face = "bold"), legend.position = "bottom", legend.direction = "horizontal") +
labs(title = "Distribution of Songs and Mean Weeks on Hot 100 (1958-2019)", x="Year",y="Number of songs",size = "Mean weeks")

Top artists and tracks

Next, we can analyze the Billboard artists. To see the artists with the biggest number of Hot 100 tracks (without featured tracks) we need to apply some functions from dplyr-package, which is heavily used throughout this whole article.

performers_entries <- billboard_hot100 %>% 
select(performer, song, song_performer) %>%
group_by(performer) %>%
summarise(n_entries = n_distinct(song_performer)) %>%
arrange(-n_entries) %>%
head(10,n)

ggplot(performers_entries, aes(x = n_entries,
y = reorder(performer,n_entries))) +
ggalt::geom_lollipop(horizontal = TRUE,
colour = "navy") +
theme_minimal() +
theme (plot.title = element_text(hjust = 0.5, size = 12, face = "bold"), legend.position = 'none') + labs(title = "#Songs Featured on Hot 100 per Artist (1958-2019)", x = "Count", y = "")

The clear leader is Drake with (64 songs) followed by The Beatles (62 songs) and Aretha Franklin (61 songs).

The tracks that are the most featured on Hot 100 by Billboard are: “Radioactive” by Imagine Dragons (85 times), “Sail” by AWOLNATION (77 times) and “I’m Yours” by Jason Mraz (75 times).

songs_entries <- billboard_hot100 %>% 
select(song_performer) %>%
group_by(song_performer) %>%
summarise(n_entries = n()) %>%
arrange(-n_entries) %>%
head(10,n)

ggplot(songs_entries, aes(x = n_entries,
y = reorder(song_performer,n_entries))) +
ggalt::geom_lollipop(horizontal = TRUE,
colour = "navy") +
theme_minimal() +
theme (plot.title = element_text(hjust = 1, size = 12, face = "bold"), legend.position = 'none') + labs(title = "#Times Featured on Hot 100 per Song | Artist (1958-2019)", x = "Count", y = "")

We can also look at which artists had the most number one entries on hot 100 during the period (without featured tracks).

top_performers <- billboard_hot100 %>% 
select(performer, song, song_performer, peak_position) %>%
group_by(performer) %>%
filter(peak_position == 1) %>%
summarise(n_entries = n_distinct(song_performer)) %>%
arrange(-n_entries) %>%
head(10,n)

ggplot(top_performers, aes(x = n_entries,
y = reorder(performer,n_entries))) +
ggalt::geom_lollipop(horizontal = TRUE, # lollipop chart
colour = "navy") +
theme_minimal() +
theme (plot.title = element_text(hjust = 0.5, size = 12, face = "bold"), legend.position = 'none') + labs(title = "#Times an Artist Reached 1 on Hot 100 (1958-2019)", x = "Count", y = "")

With astounding 19 times The Beatles is a top artist in terms of its tracks being top hits on Billboard. The other top performers include Mariah Carey (16 times) and Madonna (12 times).

But what about the artists with the highest maximum number of weeks spent on Billboard?

top_performers_weeks <- billboard_hot100 %>%
select(performer, weeks_on_chart) %>%
group_by(performer) %>%
filter(weeks_on_chart == max(weeks_on_chart)) %>%
arrange(desc(weeks_on_chart)) %>%
head(10,n)

ggplot(top_performers_weeks, aes(x = weeks_on_chart,
y = reorder(performer,weeks_on_chart))) +
ggalt::geom_lollipop(horizontal = TRUE,
colour = "navy") +
theme_minimal() +
theme (plot.title = element_text(hjust = 0.5, size = 12, face = "bold"), legend.position = 'none') + labs(title ="Maximum #Weeks on Hot 100 per Artist (1958-2019)", x = "Weeks", y = "")

Indeed Imagine Dragons spent 87 (!) weeks on Billboard, with AWOLNATION and Jasom Mraz spending 79 and 76 weeks, respectively. Consequently, we can also find the tracks with the highest maximum number of weeks spent on Billboard charts.

songs_performers_weeks <- billboard_hot100 %>%
select(song_performer, weeks_on_chart) %>%
group_by(song_performer) %>%
filter(weeks_on_chart == max(weeks_on_chart)) %>%
arrange(desc(weeks_on_chart)) %>%
head(10,n)
ggplot(songs_performers_weeks, aes(x = weeks_on_chart,
y = reorder(song_performer,weeks_on_chart))) +
ggalt::geom_lollipop(horizontal = TRUE,
colour = "navy") +
theme_minimal() +
theme (plot.title = element_text(hjust = 1, size = 12, face = "bold"), legend.position = 'none') + labs(title ="Maximum #Weeks on Hot 100 per Song | Performer (1958-2019)", x = "Weeks", y = "")

“Radioactive” by Imagine Dragons spent the highest maximum number of weeks on Billboard: 87 weeks.

Tracks positions on Billboard

Based on the Billboard data we can also explore the movement of some tracks on Billboard within a certain time frame. In this case I decided to focus on tracks that entered the chart in 2019 and for which the number of date observations >=40 (to make the plot easier for interpretation).

artists_positions_2019 <- billboard_hot100 %>%
select(song_performer, date, week_position, year) %>%
group_by(song_performer) %>%
filter(year == 2019) %>%
mutate(n = n())

artists_positions_2019 <- artists_positions_2019 %>%
filter(n >= 40)

ggplot(artists_positions_2019, aes(date, week_position, colour = song_performer)) +
geom_line() +
scale_y_continuous (trans = "reverse", breaks=c(1,25,50,75, 100), limits=c(100,0)) + #reverse the y-scale
scale_x_date(date_labels = "%b %Y") +
theme_minimal() +
theme (plot.title = element_text(hjust = 0.5, size = 12, face = "bold"),
legend.position="bottom",
legend.text = element_text(size=6, face = "italic"),
legend.title = element_blank(),
legend.text.align = 0,
legend.justification = "center") +
labs(title ="Tracks Positions on Hot 100 in 2019", x = "Date", y = "Position")

On this graph we can observe that “Old Town Road” by Lil Nas X Featuring Billy Ray Cyrus topped the Billboard Hot 100 for the longest period of time. Moreover, other songs spent on Billboard quite a long time too.

And now we can perform the same analysis based on the data from 1999.

artists_positions_1999 <- billboard_hot100 %>%
select(song_performer, date, week_position, year) %>%
group_by(song_performer) %>%
filter(year == 1999) %>%
mutate(n = n())

artists_positions_1999 <- artists_positions_1999 %>%
filter(n >= 30)
ggplot(artists_positions_1999, aes(date, week_position, colour = song_performer)) +
geom_line() +
theme_minimal() +
scale_y_continuous (trans = "reverse", breaks=c(1,25,50,75, 100), limits=c(100,0)) + #reverse the y-scale
scale_x_date(date_labels = "%b %Y") +
theme (plot.title = element_text(hjust = 0.5, size = 12, face = "bold"),
legend.position="bottom",
legend.text = element_text(size=6, face = "italic"),
legend.title = element_blank(),
legend.text.align = 0,
legend.justification = "center") +
labs(title ="Tracks Positions on Hot 100 in 1999", x = "Date", y = "Position") +
guides(colour=guide_legend(nrow=2))

We can definitely see that even the filtered songs (with more than 30 observations) from 1999 stayed on Hot 100 relatively shorter than songs in 2019, which confirms our previous findings that songs spend on Hot 100 more and more time.

In the end it’s worth looking at some all-time favorites, that return to Billboard Hot 100 over years. One of such examples is “All I Want For Christmas Is You” by Mariah Carey. The song was released in 1994, however, it was ineligible for inclusion on the Billboard Hot 100 because it was not released commercially as a single in any physical format. The song appeared and topped the Billboard Hot 100 Re-currents chart during the next years, as recurrent singles were ineligible for Billboard Hot 100. In 2012, after the recurrent rule was revised to allow all songs in the top 50 onto the Billboard Hot 100 chart, the single re-entered the chart at №29 and peaked at №21 for the week ending January 5, 2013. As can be seen on the following plot, the song successfully re-entered Billboard Hot 100 during the holiday season every year starting from 2012, and in 2019 it even topped the chart.

mariah_positions <- billboard_hot100 %>%
select(song_performer, date, week_position, year) %>%
filter(song_performer == "All I Want For Christmas Is You | Mariah Carey")
mariah_positions
mariah_positions$year_factor <- factor(mariah_positions$year, levels= c(2012, 2013, 2014, 2015, 2016, 2017, 2018 ,2019), labels=c("2012", "2013", "2014", "2015", "2016", "2017", "2018", "2019"))ggplot(mariah_positions, aes(date, week_position, colour = year_factor)) +
geom_line() +
scale_y_continuous (trans = "reverse", breaks=c(1,25,50,75, 100), limits=c(100,0)) + #reverse the y-scale
theme_minimal() +
scale_x_date(date_breaks = "1 year", date_labels = "Dec %Y") +
theme (plot.title = element_text(hjust = 0.5, size = 12, face = "bold"),
legend.position="bottom",
legend.text = element_text(size=6, face = "italic"),
legend.title = element_blank(),
legend.text.align = 0,
legend.justification = "center") +
labs(title =" All I Want For Christmas Is You | Mariah Carey on Hot 100", x = "Date", y = "Position") +
guides(colour=guide_legend(nrow=2))

Over years the song has a powerful performance on Hot 100, and its positions even improve over years. Apparently “All I Want For Christmas Is You” like wine gets better with time.

The largest increase/decrease in Billboard position over a week

Here we would like to investigate the artists with the largest position increase/decrease on Billboard within a single week.

performer_weekly_change <- billboard_hot100 %>%
select(performer, week_position, previous_week_position)

For this we need to find the difference between the current week position and the previous week position.

performer_weekly_change$difference <- performer_weekly_change$previous_week_position - performer_weekly_change$week_position

Then we can calculate the max/min difference for each performer.

performer_weekly_change <- performer_weekly_change %>%
group_by(performer) %>%
mutate(maxdifference = max(difference, na.rm = T),
mindifference = min(difference, na.rm = T))

And display top-10 artists with the largest decrease and increase in a position.

performer_weekly_change_min <- performer_weekly_change %>%
select(performer, mindifference) %>%
distinct(performer, .keep_all = TRUE) %>%
arrange(desc(mindifference)) %>%
tail(10)
performer_weekly_change_max <- performer_weekly_change %>%
select(performer, maxdifference) %>%
distinct(performer, .keep_all = TRUE) %>%
arrange(desc(maxdifference)) %>%
head(10)

The following step consists in merging these 2 dataframes. However, to be able to display negative and positive differences in one plot we have to firstly name the difference columns similarly.

names(performer_weekly_change_min)[2] <- "difference"

names(performer_weekly_change_max)[2] <- "difference"

weekly <- rbind(performer_weekly_change_max, performer_weekly_change_min)
weekly
#visualization
ggplot(data = weekly,
aes(x = reorder(performer, difference), y = difference,
fill = difference > 0))+ #to distinguish between positive and negative differences
geom_bar(stat = "identity")+
coord_flip()+
theme_minimal() +
theme (plot.title = element_text(hjust = 1, size = 12, face = "bold"), legend.position = 'none') + labs(title = "Artists with the Highest Weekly Changes on Hot 100", x = "", y = "Weekly change") +
guides(fill = FALSE)

Here we can notice that the largest increase in a Billboard position belongs to Taylor Swift Featuring Brendon Urie (+98 positions), Kelly Clarkson (+96 positions) and Britney Spears (+95 positions). The largest decrease belongs to Javier Colon (-79 positions), Jordan Smith (-78 positions) and 5 Seconds of Summer (-77 positions).

Key insights

To sum it up, the above explorative analysis has provided insights into different aspects of Billboard Hot 100 tracks.

1. Most Billboard songs have quite a high danceability, low speechiness, above-average valence (positivenesss), are moderately loud and last on average 3 min 45 seconds. In addition, the prevailing keys and modes among Hot 100 songs are C major, G major and D major, which are perceived to be positive and happy.

2. The danceability, energy and loudness of Billboard songs has substantially increased over 1958–2019, however, the valence has decreased (=songs got more sad) as well as the acousticness (songs use more synthesizers and altered voices now). In the meantime, the duration of Billboard tracks reached the peak in the 90’s, after which it started dropping, presumably due to the influence of streaming services.

3. Among top Hot 100 hits, the energy and valence is definitely higher during summer months (confirms the summer-hit phenomenon), however, the same pattern cannot be distinguished based on the distribution of danceability over months.

4. According to the valence-arousal model, which is often used in the research on music emotional systems, most top hits from Hot 100 (1958–2019) are considered to be “happy”, with high energy (=arousal) and valence, however, the share of “sad” tracks is also high. Nevertheless, the same model based on 1999–2019 only data has demonstrated that there are very few top tracks to be perceived as “sad” and “peaceful”.

5. In line with the supestar theory and winner-take-all effect, the number of unique songs on Billboard Hot 100 and number-one hits as well as the number of unique artists have substantially decreased since 1958, while the average number of weeks spent by songs has significantly increased. This can be accounted for by the appearance of super-stars dominating Billboard charts as well as the impact of digital music market.

6. Drake is a leader in terms of the number of songs featured on Hot 100 (64 songs) and “Radioactive” by Imagine Dragons appeared 85 times on the chart. Furthermore, The Beatles reached the top position on Hot 100 19 times while “Radioactive” by Imagine Dragons spent 87 weeks on the chart.

7. Tracks indeed spent much more time on Hot 100 in 2019, compared with 1999, which is showcased by the variation in tracks positions during these two years. In addition, the all-time favorite “All I Want For Christmas Is You” by Mariah Carey constantly re-enters Hot 100 every year since 2012, and its chart positions are getting better over time.

8. Taylor Swift Featuring Brendon Urie has experienced the largest weekly increase in a Hot 100 position (+98 positions), while the largest decrease belongs to Javier Colon (-79 positions).

Overall, the explorative analysis helped to discover a lot of exciting things about Billboard tracks. Nevertheless, despite different aspects of this analysis, it could still be improved via enriching the Spotify dataset (as currently it contains a lot of NA’s due to which the final total dataset was significantly reduced) as well as via more detailed investigation of some music features and artists/songs achievements.

Stay tuned for Part 2 of Billboard data analysis and make sure to give feedback in the comments. There is always a lot of room for improvement :)

--

--

“In God we trust, all other must bring data”

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Vladyslav Kushnir

Vladyslav Kushnir

“In God we trust, all other must bring data”