Billboard data analysis in R (1958–2019). Part 2

How to predict a top hit on Billboard Hot 100?

15 min readMar 27, 2021

After performing the explorative analysis on Billboard data https://vladyslavkushnir.medium.com/billboard-data-analysis-1958-2019-part-1-3afd4371627c, I got really excited about deriving some analytical insights based on this data. Therefore, for the data analysis part the merged dataset (Billboard Hot 100 and Spotify song features) will be used again.

Summary

After data wrangling different supervised learning models are employed to determine the top hit on Billboard Hot 100. The models used in this article are: logistic regresssion, stepwise logistic regression, classification tree, random forest and KNN model. Finally, the accuracy of models in predicting top hits is compared to select the most precise model.

Loading packages

As usual, we begin by installing the necessary packages.

install.packages("corrplot")
install.packages("RColorBrewer")
install.packages("RColorBrewer")
install.packages("jtools")
install.packages("rpart")
install.packages("rpart.plot")
install.packages("caret")
install.packages("rattle")
install.packages("vip")
install.packages("randomForest")
install.packages("dplyr")
install.packages("class")
install.packages("formattable")

Data wrangling

The cleaning and data wrangling was performed in Part 1, with the final dataset consisting of 23 variables and 149,926 observations for the period 1958–2019.

str(billboard_total)

For the data analysis itself we can omit some variables, for example, “performer”, “track_album”, “date” (as there is already a year variable).

billboard_2 <- billboard_total[,-c(2,3,18)]

Then we can filter the dataset by unique songs only, to avoid duplicated records that can bias our findings.

billboard_2  <- distinct(billboard_2, song, .keep_all = TRUE)
str(billboard_2)

In addition, we need to check the correlation between variables.

cor <- cor(billboard_2[,-c(1, 2, 6)])

And now we can plot the correlations.

library("corrplot")
library("RColorBrewer")
corrplot(cor, type="upper", order="hclust",
         col=brewer.pal(n=8, name="RdYlBu"))

As we observe, variables “peak_position”, “week_position” and “preivous_week_position’ are highly correlated meaning that we can move forward with only one variable, namely “peak_position”. There are some high correlations among other variables too (e.g., “spotify_track_popularity” and “year”), however, they are not that high.

billboard_2 <- billboard_2[,-c(16,17)]

Data analysis

For this dataset I envisioned building supervised learning models to predict if a song will be a top hit on Billboard Hot 100 or not. Which features contribute to the song success on Billboard and which do not? Which model is the most precise in determining a top Billboard hit? To answer these questions, I will build a logistic regression, stepwise logistic regression, classification tree, random forest model and KNN model and after training and testing them, I will compare their accuracy.

Firstly, I need to create a categorical variable “top_hit” with 2 levels: 1 if the song reached the top position on Billboard Hot 100, and 0 if not.

billboard_2$top_hit <- ifelse(billboard_2$peak_position <=1, 1, 
    0)
billboard_2$top_hit <- as.factor(billboard_2$top_hit)

billboard_2 <- billboard_2[,-c(1,16)]

Logistic regression

The first model to predict the song success is the logistic regression. Logistic regression is a method used to predict a dependent variable (Y), given an independent variable (X), such that the dependent variable is categorical.

fit_logit <- glm(top_hit ~ ., 
           data = billboard_2, family = binomial())

Now we can inspect the summary.

summary(fit_logit)

As the output does not include an R-squared value to asses model fit, multiple “Pseudo R-squared” equivalents similar to the one used in OLS, have been developed.

logisticPseudoR2s <- function(LogModel) {
    dev <- LogModel$deviance
    nullDev <- LogModel$null.deviance
    modelN <- length(LogModel$fitted.values)
    R.l <- 1 - dev/nullDev
    R.cs <- 1 - exp(-(nullDev - dev)/modelN)
    R.n <- R.cs/(1 - (exp(-(nullDev/modelN))))
    cat("Pseudo R^2 for logistic regression\n")
    cat("Hosmer and Lemeshow R^2  ", round(R.l, 3), 
        "\n")
    cat("Cox and Snell R^2        ", round(R.cs, 3), 
        "\n")
    cat("Nagelkerke R^2           ", round(R.n, 3), 
        "\n")
}logisticPseudoR2s(fit_logit)

Indeed, the logit model performs quite good in terms of its power to explain the variance of the dependent variable, namely “top_hit”.

Alternatively, we get a clear overview of the regression summary with Pseudo R-squared measures via jtools package.

library("jtools")
summ(fit_logit)

Since the coefficients of the model give the change in the log odds of the dependent variable, for the purpose of interpretation we can obtain odds ratios by taking the exponent of coefficients.

exp(coef(fit_logit))

However, what do all these numbers tell us? Given the significance of the coefficients, they can be interpreted in the following way:

1) if the track is explicit, the probability that it’s going to be a top hit on Billboard Hot 100 increases by 3.7;

2) if the Spotify track popularity increases by 1 point (on the scale 0–100), the probability of the song success increase by 1.1;

3) if the danceability of the track increases by 1 point (on the scale 0–1), the probability that a song will reach the top spot on Hot 100 increases by 9.3. This finding is in line with previous findings, for example. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5990848/

4) if the energy of the track increases by 1 point (on the scale 0–1), the probability that a song will reach the top spot on Hot 100 increases by 3.1;

5) a 1-point change in the time signature increases the chances of a song becoming a top hit by 6.3;

6) if the duration of a song staying on Billboard Hot 100 increases by 1 week, the probability of a song becoming a top hit increases by 1.1;

7) if the duration of the song increases by 1 second, its success chances increase by 1;

8) if the year of a song getting on Hot 100 increases by 1, its success chances increase by 9.4.

Stepwise logistic regression

Furthermore, we can employ the step-funtion to choose the logistic regression with a set of features that performs the best. The stepwise regression (or stepwise selection) consists of iteratively adding and removing predictors, in the predictive model, in order to find the subset of variables in the data set resulting in the best performing model, that is a model that lowers prediction error.

fit_logit_step <- step(fit_logit)

Now let’s print the summary of the chosen model.

summ(fit_logit_step)

Interestingly, the Pseudo R-squared measures of the selected model have decreased a bit.

exp(coef(fit_logit_step))

According to the stepwise regression, the most important variables affecting the song success are energy along with danceability, explicit character of the track lyrics, Spotify track popularity, danceability, loudness, time signature, weeks spent on chart, duration in seconds and year.

Classification tree

The algorithm of decision tree models works by repeatedly partitioning the data into multiple sub-spaces, so that the outcomes in each final sub-space are as homogeneous as possible. For the classification tree model two additional packages are required.

library("rpart")
library("rpart.plot")

As before, we are working with the dependent categorical variable “top_hit”.

fit_ct <- rpart(top_hit ~ ., data = billboard_2, method = "class")

Now let’s visualize the tree itself.

rpart.plot(fit_ct, box.palette="RdBu", shadow.col="gray", nn=TRUE)

It is clear that this plot is not informative enough for interpretation, therefore, an alternative plot can be used.

library("rattle")
fancyRpartPlot(fit_ct)

According to the classification tree model, the most important variables in Billboard hit prediction are weeks on chart, Spotify track popularity, year, key signature, track duration (sec), tempo, acousticness. The tree consists of 17 nodes.

An alternative to explicitly specifying the depth of a decision tree is to grow a very large, complex tree and then prune it back to find an optimal subtree based on the complexity parameter. The complexity parameter cp in rpart is the minimum improvement in the model needed at each node in order for a split to be performed. Any split that does not decrease the overall lack of fit by a factor of cp is not attempted.

printcp(fit_ct)

We can also plot it. A good choice of `cp` for pruning is often the leftmost value for which the mean relative error lies below the horizontal line. In this case the optimal cp consitutes 0.010000.

plotcp(fit_ct)

As we can from the output, we can prune our model for complexity parameter = 0.010000.

fit_ct_pruned <- prune(fit_ct, cp = 0.010000)

fancyRpartPlot(fit_ct_pruned)

The interpretation of the pruned tree is as follows. The root node, at the top, shows that 98% of Billboard songs don’t reach the top spot, while 2% do so. The number above these proportions indicates the way that the node is voting and the number below indicates the proportion of the dataset that resides in this node, or bucket (here at the top level it is every song or 100%). Now let’s travel down the tree branches to the next nodes down the tree.

1) A Billboard track having spent less than 13 weeks on Billboard, reaches the top spot only in 1% of all cases.

2) Songs having spent more than 13 weeks on Hot 100 and having Spotify popularity below 60 (on the scale 0–100), appear to be top hits in 8% of cases.

3) Similar to that, songs like mentioned in the 2nd point and which entered Hot 100 at least in 2009, tend to be top hits on Hot 100 in 12% of cases.

4) For songs which spent more than 13 weeks on Hot 100, have Spotify popularity of higher than 69, entered Hot 100 before 2009, have any of the key signatures, the track duration is decisive, with 16% of the songs with the track duration lower than 269 sec (4.5 minutes) being classified as top hits on Billboard.

5) Furthermore, in line with the 4th point, songs which entered the chart at least in 1999 reached the top spot in 32% of cases.

6) For songs which spent more than 13 weeks on Hot 100, have Spotify popularity of higher than 69, entered Hot 100 before 2009, have any of the key signatures, the tempo is decisive, while the songs having at least 140 beats per minute are classified as top hits on Billboard in 16% of cases.

7) Similarly to that, the songs whose acousticnesss is higher than 0.01 reach the top position on Hot 100 in 29% of cases.

Finally, we can plot the feature importance with the help of vip-package. These values are standardized so that the most important feature has a value of 100 and the remaining features are scored based on their relative reduction in the loss function.

library("vip")

The top-3 variables affecting the song success according to the classification tree model are weeks spent on Hot 100, year of entering the chart and Spotify track popularity.

Random forest

Random forests are a modification of bagged decision trees that build a large collection of de-correlated trees to further improve predictive performance. They have become a very popular “out-of-the-box” or “off-the-shelf” learning algorithm that enjoys good predictive performance with relatively little hyperparameter tuning. Firstly, we need to download the corresponding package.

library("randomForest")

To run the random forest model we need to dispose of character variables.

library("dplyr")
billboard_2 <- billboard_2 %>% 
  mutate_if(is.character, as.factor)

Then we can build the model itself.

fit_crf <- randomForest(top_hit ~ ., data = billboard_2,
                    importance = TRUE)

Let’s check the output of the model.

The model is built based on 500 trees with 4 predictors at each split. The output contains information regarding the out-of-bag (OOB) estimate of error and it shows a matrix containing prediction vs actual observations, as well as classification error for each class.

While the classification error for the songs that didn’t reach the top spot on Hot 100 is very low, it consitutes 1.0 in the opposite case, which has to be resolved. One solution is trying a different number of variables selected at each split that is denoted by mtry in randomforest function. Let’s select mtry value with minimum out of bag(OOB) error.

str(billboard_2)mtry <- tuneRF(billboard_2[-17],billboard_2$top_hit, ntreeTry=500,
               stepFactor=1.5,improve=0.01, trace=TRUE, plot=TRUE)
best.m <- mtry[mtry[, 2] == min(mtry[, 2]), 1]
print(mtry)
print(best.m)

Indeed, the lowest OBB error belongs to mtry = 6. We can tune our model respectively.

fit_crf_pruned <-randomForest(top_hit~.,data=billboard_2, mtry=best.m, importance=TRUE,ntree=500)fit_crf_pruned

As we see, the classification error for the top hit prediction decreased.

The importance of the variables is:

importance(fit_crf_pruned)

The importance can also be plotted:

varImpPlot(fit_crf_pruned)

According to this approach, the most important variables for the hit prediction are year of entering Hot 100, Spotify track popularity and weeks spent on chart (based on mean decrease accuracy) and key signature, weeks spent on chart and Spotify track popularity (based on mean decrease Gini). Gini coefficient shows how each variable contributes to the homogeneity of the nodes and leaves in the resulting random forest).

KNN model

Finally I would like to employ the KNN model. KNN which stand for K Nearest Neighbor is a supervised machine learning algorithm that classifies a new data point into the target class, depending on the features of its neighboring data points. The characteristics of KNN algorithm include:

simple classification and regression algorithm;
a lazy learner because it doesn’t learn much from the training data;
supervised learning algorithm (similarly to previously employed methods);
default method is Euclidean distance (shortest distance between 2 points) for continuous variables, whereas for discrete variables the overlap metric(Hamming distance) is utilized.

Generally k gets decided on the square root of number of data points. In addition, before using the data for the KNN-algorithm observations need to be normalized, i.e., variables need to be transformed onto the same scale to avoid biasness due to variables being measured on different scales.
To begin with, we should normalize the data by creating a special function for that purpose.

normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x))) } # creating a normalize function for easy convertion.

For KNN analysis, it’s necessary to recode factor variables into numeric ones, namely key signature and explicit character of the track.

billboard_knn <- billboard_2billboard_knn$key_signature <- factor(billboard_knn$key_signature, levels = c("A major", "A minor","A# major", "A# minor", "B major", "B minor", "C major", "C minor", "C# major", "C# minor", "D major", "D minor", "D# major", "D# minor", "E major", "E minor","F major",  "F minor", "F# major", "F# minor", "G major", "G minor", "G# major", "G# minor"), labels = c(1:24))billboard_knn$track_explicit <- factor(billboard_knn$track_explicit, levels = c("FALSE", "TRUE"),labels = c(0:1))billboard_knn$key_signature <- as.numeric(billboard_knn$key_signature)
billboard_knn$track_explicit <- as.numeric(billboard_knn$track_explicit)

Then we can normalize the data via the function we’ve created before.

billboard_knn_norm<- as.data.frame(lapply(billboard_knn[,1:16], normalize)) 
head(billboard_knn_norm)

Now all attributes have values in the range 0 to 1 which is normalised data and ‘top_hit’ column has been removed. To build and test the model we need to split this dataset into the training (80% of observationss) and test (20% of observations) datasets.

set.seed(123)  # To get the same random sample
random <- sample(1:nrow(billboard_knn_norm),size=nrow(billboard_knn_norm)*0.8,replace = FALSE) #random selection of 80% data.

knn_train<- billboard_knn[random,] # 80% training data
knn_test <- billboard_knn[-random,] # remaining 20% test data

Now we can create a seperate dataframe for ‘top_hit’ feature which is our target.

knn_train_labels <- billboard_knn[random,17]
knn_test_labels  <- billboard_knn[-random,17]

For the KNN model, we need to use class package.

library("class")

To identify optimum value of k, generally square root of total number of observations is taken, which in this case is 9221.

NROW(knn_train_labels)

As the square root of 9221 is approximately 96.03, we can try k=96.

knn.96 <-  knn(train=knn_train, test=knn_test, cl=knn_train_labels, k=96)

Let’s check the accuracy of the model with the help of caret package.

For the record: accuracy is the number of correctly predicted data points out of all the data points.

library("caret")
confusionMatrix(knn.96,knn_test_labels)

The model accuracy is quite high and constitutes 0.98. 2265 & 0 are the correct predictions against actual whereas 0 & 41 are wrong predictions against actual meaning that the model can still be improved.

For the KNN algorithm, the tuning parameters are ‘k’ value and number of features/attributes selection. Optimum ‘k’ value can be found using ‘elbow’ or ‘maximum % accuracy’ graph but ‘feature selection’ can be done only through understanding of features in KNN algorithm.

i=1                          # declaration to initiate for loop
k.optm=1                     # declaration to initiate for loop
for (i in 1:100){ 
    knn.mod <-  knn(train=knn_train, test=knn_test, cl=knn_train_labels, k=i)
    k.optm[i] <- 100 * sum(knn_test_labels == knn.mod)/NROW(knn_test_labels)
    k=i  
    cat(k,'=',k.optm[i],'\n')       # to print % accuracy 
}

We can also plot the accuracy of models at different levels of k.

plot(k.optm, type="b", xlab="K- Value",ylab="Accuracy level")

As we see, the accuracy of models in predicting top Billboad hits increases till k=13, after which it remains at the same level of 98.22%. It means that we don’t have to change the model.

Choosing the best model

Now we can assess the accuracy of four models (logistic regression, stepwise logistic regression, classification tree and random forest) based on the predict function. For this purpose we first build a training and a test sample. We can split our data set into the training sample (80% of observations) and test sample (20% of observations), althouth other options, such as, for example, k-fold cross validation are also possible.

#data splicing
set.seed(12345)
train <- sample(1:nrow(billboard_2),size = ceiling(0.80*nrow(billboard_2)),replace = FALSE)# training set
billboard_train <- billboard_2[train,]# test set
billboard_test <- billboard_2[-train,]

Now we fit the models on the train sample.

train_logit <- glm(top_hit ~ ., 
           data = billboard_train, family = binomial())train_logit_step <- train(top_hit ~ track_explicit + spotify_track_popularity + danceability +
                            energy + loudness + time_signature + track_duration_sec + weeks_on_chart + year, 
                 data = billboard_train, family = binomial())train_ct_pruned <- rpart(top_hit ~ ., 
                    data = billboard_train,
                    cp = 0.010000)train_crf_pruned <- randomForest(top_hit ~ ., 
                    data = billboard_train,
                    mtry=best.m, importance=TRUE,ntree=500)

We can now test our models on the remaining data.

pred_logit <- predict(train_logit, newdata = billboard_test)

pred_logit_step <- predict(train_logit_step, newdata = billboard_test)

pred_ct_pruned <- predict(train_ct_pruned, newdata = billboard_test, type = "class")

pred_crf_pruned <- predict(train_crf_pruned, newdata = billboard_test)

Next, we examine the confusion matrices needed to look at the accuracy.

tab_logit <- table(pred_logit, billboard_test$top_hit)

tab_logit_step <- table(pred_logit_step, billboard_test$top_hit)

tab_ct_pruned <- table(pred_ct_pruned, billboard_test$top_hit)

tab_crf_pruned <- table(pred_crf_pruned, billboard_test$top_hit)

Here the accuracy (true positive/total true positive) of the test dataset for the logistic regression, stepwise logistic regression, classification tree and random forest is computed.

acc_logit <- sum(diag(tab_logit))/sum(tab_logit)
acc_logit

acc_logit_step <- sum(diag(tab_logit_step))/sum(tab_logit_step)
acc_logit_step

acc_ct_pruned <- sum(diag(tab_ct_pruned))/sum(tab_ct_pruned)
acc_ct_pruned

acc_crf_pruned <- sum(diag(tab_crf_pruned))/sum(tab_crf_pruned)
acc_crf_pruned

Let’s look at the accuracy of five models (logistic regression, stepwise logistic regression, pruned classification tree, pruned random forest and KNN-model) in the table via the formattable package.

For this a normal table needs to be created first.

a <- c("Logistic regression", "Stepwise logistic regression", "Pruned classification tree", "Pruned random forest", "KNN")
b <- c(0.0004, 0.9774, 0.9735, 0.9783, 0.9822)
accuracy <- data.frame(Model=a,
                      Accuracy=b)
accuracy

Now we can improve the table look.

library("formattable")

customRed = "#FF6347"
customGreen2 = "#71CA97"
formattable(accuracy, align =c("l","c","c","c","c", "c", "c", "c", "r"), list(
  `Model` = formatter("span", style = ~ style(color = "grey",font.weight = "bold")), 
  `Accuracy`= color_tile(customRed, customGreen2)
))

Indeed, the lowest accuracy belongs to the logit model, as this model has a lot of unnecessary predictor variables (not parsimonious). However, the rest four models perform really well (the accuracy is around 97–98%), with a slight leadership in predicting top Hot 100 hits belonging to the KNN model (accuracy at 98.22%).

Key insights

In general, this data analysis demonstrated that different factors, such as energy, danceability, explicit character of the track lyrics, Spotify track popularity, danceability, loudness, time signature, weeks spent on chart, duration in seconds and year of entering Hot 100 affect the probability of a track becoming a top hit on Billboard. Among the supervised learning models, KNN model appeared to have the highest accuracy of prediction (>98%), while the logistic regression model appeared to perform quite poorly.

For further research it would be interesting to use more enriched dataset as well as tune the current models more so as to achieve higher accuracy, for example. In addition, other types of testing the models accuracy can be employed.