In this project I use topic models to model LEGO color themes. Topic models were originally designed to find themes in document corpora but have been applied to problems in genetics and image analysis. For the LEGO sets, the colors take the role of words or terms in a topic model and the sets are considered documents. The goal of the topic model(LDA), is to find coherent color themes in the LEGO dataset.
You can also apply other text mining techniques that rely only on word frequency and not word order. For example, I use the TF-IDF score to plot uncommon and common color-set combinations.
Part of the purpose of this project was to try out the text analysis techniques in Julia Silge and David Robinson’s Text Mining with R book. A good deal of the code is an adaptation of theirs.
I’ve wrote up the results on my blog and I also have a walk-through of the notebook on Kaggle. The about page has instruction if you want to install the package and re-run the code.
Note: I have not tested the instructions for installing from Github to reproduce the analysis.
1. Exploratory plots; High and low TF-IDF
2. Training and evaluating the model