Module 9
You can download the R Markdown file (https://gedeck.github.io/DS-6030/homework/Module-9.Rmd) and use it to answer the following questions.
If not otherwise stated, use Tidyverse and Tidymodels for the assignments.
1. Sentiment analysis using SVM
In this assignment, we will build a model to predict the sentiment expressed in Amazon reviews. In order to build a model, we need to convert the text review into a numeric representation. We will use the textrecipes
package to process the text data.
This assignment is only a first glimpse into handling text data. For a detailed introduction to text analytics in tidymodels see Hvitfeldt and Silge (2022, Supervised Machine Learning for Text Analysis in R).
The data are taken from https://archive.ics.uci.edu/dataset/331/sentiment+labelled+sentences.
You can load the data from https://gedeck.github.io/DS-6030/datasets/homework/sentiment_labelled_sentences/amazon_cells_labelled.txt
You will need to install the packages textrecipes
and stopwords
to complete this assignment.
- Setup
(1.1) Load the data. The data has no column headers. Each line contains a review sentence separated by a tab character (\t
) from the sentiment label (0 or 1). Create a tibble with the column names sentence
and sentiment
. Use the tidyverse function read_delim
to load the data. The dataset has 1000 rows (the read.csv
function fails to load the data correctly). Don’t forget to convert sentiment to a factor. (2 point - coding)
(1.2) Split the dataset into training (80%) and test sets (20%). Prepare resamples from the training set for 10-fold cross validation. (1 point - coding)
(1.3) Create a recipe to process the text data. The formula is sentiment ~ sentence
. Add the following steps to the recipe: (1 point - coding)
step_tokenize(sentence)
to tokenize the text (split into words).step_tokenfilter(sentence, max_tokens=1000)
to remove infrequent tokens keepling only the 1000 most frequent tokens. This will give you a term frequency matrix (for each token, how often a token occurs in the sentence)step_tfidf(sentence)
applies function to create a term frequency-inverse document frequency matrix.- Use the
step_normalize()
function to normalize the data. - Use the
step_pca()
function to reduce the dimensionality of the data. Tune the number of components,num_comp
, in a range of 200 to 700.
- Model training Create workflows with the recipe from (c) and tune the following models:
(1.4) logistic regression with L1 regularization (glmnet
engine tuning penalty
) (2 point coding)
(1.5) SVM with linear kernel (kernlab
engine tuning cost
and margin
) (2 point coding)
(1.6) SVM with polynomial kernel (kernlab
engine tuning cost
, margin
, and degree
; use e.g. degree = degree_int(range=c(2, 5))
) (2 point coding)
(1.7) SVM with radial basis function kernel (kernlab
engine tuning cost
, margin
, and rbf_sigma
; use rbf_sigma(range=c(-4, 0), trans=log10_trans())
) (2 point coding)
Keep the default tuning ranges and only udpate rbf_sigma
as mentioned above. For the PCA preprocessing step, tune num_comp
in a range of 200 to 700.
Use Bayesian hyperparameter optimization to tune the models. What are the tuned hyperparameters for each model?
- Model performance
Once you have tuned the models, fit finalized models and assess their performance.
(1.8) Compare the cross-validation performance of the models using ROC curves (combine in one graph) and performance metrics (AUC and accuracy). Which model performs best? (2 points - discussion)
(1.9) Compare the performance of the finalized models on the test set. Which model performs best? (2 points - discussion)