Module 9

You can download the R Markdown file (https://gedeck.github.io/DS-6030/homework/Module-9.Rmd) and use it to answer the following questions.

If not otherwise stated, use Tidyverse and Tidymodels for the assignments.

1. Sentiment analysis using SVM

In this assignment, we will build a model to predict the sentiment expressed in Amazon reviews. In order to build a model, we need to convert the text review into a numeric representation. We will use the textrecipes package to process the text data.

This assignment is only a first glimpse into handling text data. For a detailed introduction to text analytics in tidymodels see Hvitfeldt and Silge (2022, Supervised Machine Learning for Text Analysis in R).

The data are taken from https://archive.ics.uci.edu/dataset/331/sentiment+labelled+sentences.

You can load the data from https://gedeck.github.io/DS-6030/datasets/homework/sentiment_labelled_sentences/amazon_cells_labelled.txt

You will need to install the packages textrecipes and stopwords to complete this assignment.

  1. Setup

(1.1) Load the data. The data has no column headers. Each line contains a review sentence separated by a tab character (\t) from the sentiment label (0 or 1). Create a tibble with the column names sentence and sentiment. Use the tidyverse function read_delim to load the data. The dataset has 1000 rows (the read.csv function fails to load the data correctly). Don’t forget to convert sentiment to a factor. (2 point - coding)

(1.2) Split the dataset into training (80%) and test sets (20%). Prepare resamples from the training set for 10-fold cross validation. (1 point - coding)

(1.3) Create a recipe to process the text data. The formula is sentiment ~ sentence. Add the following steps to the recipe: (1 point - coding)

  • step_tokenize(sentence) to tokenize the text (split into words).
  • step_tokenfilter(sentence, max_tokens=1000) to remove infrequent tokens keepling only the 1000 most frequent tokens. This will give you a term frequency matrix (for each token, how often a token occurs in the sentence)
  • step_tfidf(sentence) applies function to create a term frequency-inverse document frequency matrix.
  • Use the step_normalize() function to normalize the data.
  • Use the step_pca() function to reduce the dimensionality of the data. Tune the number of components, num_comp, in a range of 200 to 700.
  1. Model training Create workflows with the recipe from (c) and tune the following models:

(1.4) logistic regression with L1 regularization (glmnet engine tuning penalty) (2 point coding)

(1.5) SVM with linear kernel (kernlab engine tuning cost and margin) (2 point coding)

(1.6) SVM with polynomial kernel (kernlab engine tuning cost, margin, and degree; use e.g. degree = degree_int(range=c(2, 5))) (2 point coding)

(1.7) SVM with radial basis function kernel (kernlab engine tuning cost, margin, and rbf_sigma; use rbf_sigma(range=c(-4, 0), trans=log10_trans())) (2 point coding)

Keep the default tuning ranges and only udpate rbf_sigma as mentioned above. For the PCA preprocessing step, tune num_comp in a range of 200 to 700. Use Bayesian hyperparameter optimization to tune the models. What are the tuned hyperparameters for each model?

  1. Model performance

Once you have tuned the models, fit finalized models and assess their performance.

(1.8) Compare the cross-validation performance of the models using ROC curves (combine in one graph) and performance metrics (AUC and accuracy). Which model performs best? (2 points - discussion)

(1.9) Compare the performance of the finalized models on the test set. Which model performs best? (2 points - discussion)