This project was for a machine learning applications course. It was completed within a week. The project goal was the same for everyone: given a dataset of real music lyrics, classify them into one of three genres (Rock, Hip Hop, and Pop).
I needed to build a classifier, but I lacked domain knowledge in lyric composition. While there may be beautiful patterns that help sort the lyrics into a genre, I do not know any of them. So, it was hard to do any useful feature extraction from the lyrics. Things like lyrics length and number of verses vary a lot so simply finding quantities would not be enough. With limited time to work on this, it'd be difficult to gain a deeper knowledge of lyrical composition.
The challenge here was in processing the lyrics. It was a (long) text feature, and so as a categorical data type is a bit tricky to convert into a numerical feature for use in other regression methods. That's when I began to wonder if I needed to work hard to convert the categorical data to numerical. Surely, there must be something already out there to handle text features easily.
While exploring some ML libraries, I came across CatBoost. CatBoost employs gradient descent and was especially interesting for its focus on working with categorical data. After a little digging deeper into the CatBoost docs (which was also a challenge), I found that CatBoost actually has support for text features.
import pandas as pd import numpy as np from sklearn.neural_network import MLPClassifier from sklearn.metrics import accuracy_score from catboost import Pool, CatBoostClassifier
text_features = ['Lyric'] train_dataset = Pool(data=train_X, label=train_y, text_features=text_features) model = CatBoostClassifier(iterations=100, learning_rate=1, depth=3, loss_function='MultiClass') model.fit(train_dataset)
After implementing CatBoost, the model's accuracy: ~68% at classifying the correct genre to existing lyrics.
pred = model.predict(holdout_set.drop('Genre',axis=1)) estimated_accuracy = accuracy_score(holdout_set['Genre'], pred) print(estimated_accuracy) pd.Series(estimated_accuracy).to_csv('ea.csv', index=False, header=False)