DS-6030 Homework Module 9

Author

Note

You can download the Quarto Markdown file and use it to answer the following questions.

This assignment is only a first glimpse into handling text data. For a detailed introduction to text analytics in tidymodels see Hvitfeldt and Silge (2022, Supervised Machine Learning for Text Analysis in R).

If not otherwise stated, use Tidyverse and Tidymodels for the assignments.

WarningLLM use

You may use LLMs (ChatGPT, Claude, Copilot, etc.) on this assignment. If you do, you must:

  1. Disclose which LLM you used and roughly what for (concept clarification, code generation, prose review).
  2. Include the prompts for any output you used substantially. Paste them into a “LLM use” appendix at the end of your submission.
  3. Verify the output — LLMs frequently make small but plausible-sounding errors (wrong variable names, made-up tidymodels functions, wrong claims about dataset properties). Check anything you keep against the course material or the actual data.

The grader will spot-check disclosure. Undisclosed LLM use is treated as a citation failure.

1. Sentiment analysis using logistic regression and SVM models

In this assignment, we will build a model to predict the sentiment expressed in Amazon reviews. In order to build a model, we need to convert the text review into a numeric representation. We will use the textrecipes package to process the text data.

The data are taken from the Sentiment Labelled Sentences dataset at the UC Irvine Machine Learning Repository. We use the amazon_cells_labelled.txt data for this assignment.

You will need to install the packages textrecipes and stopwords to complete this assignment.

  1. Setup:

(1.1) Load the data. The data has no column headers. Each line contains a review sentence separated by a tab character (\t) from the sentiment label (0 or 1). Create a tibble with the column names sentence and sentiment. Use the tidyverse function read_delim to load the data. The dataset has 1000 rows (the read.csv function fails to load the data correctly). Don’t forget to convert sentiment to a factor. (2 points - coding)

(1.2) Split the dataset into training (80%) and test sets (20%). Prepare resamples from the training set for 10-fold cross validation. (1 point - coding)

(1.3) Create a recipe to process the text data. The formula is sentiment ~ sentence. Add the following steps to the recipe: (1 point - coding)

  • step_tokenize(sentence) to tokenize the text (split into words).
  • step_tokenfilter(sentence, max_tokens=1000) to remove infrequent tokens keeping only the 1000 most frequent tokens. This will give you a term frequency matrix (for each token, how often a token occurs in the sentence)
  • step_tfidf(sentence) applies a function to create a term frequency-inverse document frequency matrix.
  • Use the step_normalize() function to normalize the data.
  • Use the step_pca() function to reduce the dimensionality of the data. Tune the number of components, num_comp, in a range of 200 to 700.
  1. Model training: Create workflows with the recipe from (1.3) and tune the following models:

(1.4) logistic regression with L1 regularization (glmnet engine tuning penalty) (2 points - coding)

(1.5) SVM with linear kernel (kernlab engine tuning cost) (2 points - coding)

(1.6) SVM with polynomial kernel (kernlab engine tuning cost and degree; use e.g. degree = degree_int(range=c(2, 5))) (2 points - coding)

(1.7) SVM with radial basis function kernel (kernlab engine tuning cost and rbf_sigma; use rbf_sigma(range=c(-4, 0), trans=log10_trans())) (2 points - coding)

Keep the default tuning ranges and only update rbf_sigma as mentioned above. For the PCA preprocessing step, tune num_comp in a range of 200 to 700. Use Bayesian hyperparameter optimization to tune the models. What are the tuned hyperparameters for each model?

  1. Model performance:

Once you have tuned the models, fit finalized models and assess their performance.

(1.8) Compare the cross-validation performance of the models using ROC curves (combine in one graph) and performance metrics (AUC and accuracy). Which model performs best? (2 points - discussion)

(1.9) Compare the performance of the finalized models on the test set. Which model performs best? (2 points - discussion)