DS-6030 Homework Module 9

Author

Note

You can download the Quarto Markdown file and use it to answer the following questions.

This assignment is only a first glimpse into handling text data. For a detailed introduction to text analytics in tidymodels see Hvitfeldt and Silge (2022, Supervised Machine Learning for Text Analysis in R).

If not otherwise stated, use Tidyverse and Tidymodels for the assignments.

WarningLLM use

You may use LLMs (ChatGPT, Claude, Copilot, etc.) on this assignment. If you do, you must:

  1. Disclose which LLM you used and roughly what for (concept clarification, code generation, prose review).
  2. Include the prompts for any output you used substantially. Paste them into a “LLM use” appendix at the end of your submission.
  3. Verify the output — LLMs frequently make small but plausible-sounding errors (wrong variable names, made-up tidymodels functions, wrong claims about dataset properties). Check anything you keep against the course material or the actual data.

The grader will spot-check disclosure. Undisclosed LLM use is treated as a citation failure.

1. Sentiment analysis using logistic regression and SVM models (19 points)

In this assignment, we will build a model to predict the sentiment expressed in Amazon reviews. In order to build a model, we need to convert the text review into a numeric representation. We will use the textrecipes package to process the text data.

The data are taken from the Sentiment Labelled Sentences dataset at the UC Irvine Machine Learning Repository. We use the amazon_cells_labelled.txt data for this assignment.

You will need to install the packages textrecipes and stopwords to complete this assignment.

  1. Setup:

(1.1) Load the data. The data has no column headers. Each line contains a review sentence separated by a tab character (\t) from the sentiment label (0 or 1). Create a tibble with the column names sentence and sentiment. Use the tidyverse function read_delim to load the data. The dataset has 1000 rows (the read.csv function fails to load the data correctly). Convert sentiment to a factor with levels negative (0) and positive (1) so that the positive class is unambiguous in later ROC and metric reporting. (2 points - coding)

(1.2) Split the dataset into training (80%) and test sets (20%) using stratified sampling on sentiment. Use set.seed() so that the split is reproducible. Prepare resamples from the training set for 10-fold cross validation. (1 point - coding)

(1.3) Create a recipe to process the text data. The formula is sentiment ~ sentence. Add the following steps to the recipe: (1 point - coding)

  • step_tokenize(sentence) to tokenize the text (split into words).
  • step_tokenfilter(sentence, max_tokens=1000) to keep only the 1000 most frequent tokens.
  • step_tfidf(sentence) to convert the filtered tokens into a term frequency–inverse document frequency (TF-IDF) matrix (one column per token, weighted by how often the token appears in this sentence relative to how often it appears across all sentences).
  • Use the step_normalize() function to normalize the data.
  • Use the step_pca() function to reduce the dimensionality of the data. Tune the number of components.
  1. Model training:

Create workflows with the recipe from (1.3) and tune the following four models. For each model, use Bayesian hyperparameter optimization with 10-fold cross-validation, tune num_comp in the PCA step over the range 200 to 700, and keep the default tuning ranges for all other parameters except where stated. Report the tuned hyperparameters for each model.

(1.4) logistic regression with L1 regularization (glmnet engine tuning penalty) (2 points - coding)

(1.5) SVM with linear kernel (kernlab engine tuning cost) (2 points - coding)

(1.6) SVM with polynomial kernel (kernlab engine tuning cost and degree; use e.g. degree = degree_int(range=c(2, 5))) (2 points - coding)

(1.7) SVM with radial basis function kernel (kernlab engine tuning cost and rbf_sigma; use rbf_sigma(range=c(-4, 0), trans=log10_trans())) (2 points - coding)

  1. Model performance:

Once you have tuned the models, fit finalized models and assess their performance.

(1.8) Compare the cross-validation performance of the models using ROC curves (combine in one graph) and performance metrics (AUC and accuracy). Which model performs best? Hint: to obtain the cross-validation predictions needed for the ROC curves, either set control = control_bayes(save_pred = TRUE) when calling tune_bayes, or refit the finalized workflow on the resamples with fit_resamples. (2 points - discussion)

(1.9) Compare the performance of the finalized models on the test set. Which model performs best? (2 points - discussion)

(1.10) Describe your modeling pipeline to an LLM and ask it to critique the approach. A reasonable summary would be:

“I built a sentiment classifier on 1000 Amazon product-review sentences (balanced 500/500). My preprocessing tokenized into unigrams, kept the top 1000 tokens by frequency, computed TF-IDF, normalized, and applied PCA with 200–700 components (tuned by CV). I trained logistic regression with L1, linear SVM, polynomial SVM, and RBF SVM, tuned with Bayesian optimization and 10-fold CV.”

Then ask: “What are the main weaknesses of this approach for sentiment classification, and what would you change?”

Evaluate the LLM’s critique:

  • Which of its points engage with sentiment-specific issues — for example, that TF-IDF’s IDF term down-weights the very words that carry sentiment (because they appear in many documents), that PCA is unsupervised so its top components capture topic and document-length variance rather than class separability, that unigrams cannot represent negation (“not good”), or that pretrained text embeddings would likely outperform this pipeline on 1000 rows?
  • Which of its points are generic ML advice that doesn’t engage with text or sentiment specifically (e.g. “try more models”, “tune more hyperparameters”, “do more cross-validation”)?
  • Did it miss anything that your CV results in (1.8) actually surfaced — for example, which kernel won on this sparse text problem, or whether the kernel SVMs were worth their tuning cost?

Pick at least two of the LLM’s points and judge them against the results you observed in (1.8) and (1.9). Where the LLM is right, say so; where it is wrong or generic, say why.

(3 points - discussion)