Module 10

You can download the R Markdown file (https://gedeck.github.io/DS-6030/homework/Module-10.Rmd) and use it to answer the following questions.

If not otherwise stated, use Tidyverse, Tidymodels, and Tidyclust for the assignments.

1. Analyzing the ANES 2022 Pilot Study - PCA

The ANES 2022 Pilot Study is a cross-sectional survey conducted to test new questions under consideration for potential inclusion in the ANES 2024 Time Series Study and to provide data about voting and public opinion after the 2022 midterm elections in the United States. Information about this study is available at https://electionstudies.org/data-center/2022-pilot-study/.

Setup

Load the data from https://gedeck.github.io/DS-6030/datasets/anes_pilot_2022_csv_20221214/anes_pilot_2022_csv_20221214.csv

The dataset contains information about the respondents profile (e.g. birthyr, gender, race, educ, marstat, …) and answers to 235 questions from different categories.

Load and preprocess the data. The data contain answers to 235 questions (see PDF questionnaire). The dataset also contains a number of variables that are not questions, but rather contain information about how the survey was conducted (see user’s guide and codebook).

(1.1) Identify the feeling thermometer questions. These questions ask respondents to rate their feelings toward a number of groups on a scale from 0 to 100. The questions are listed in variables starting with ft.... Identify the names of all feeling thermometer questions ignoring the ftblack and ftwhite questions as these were only asked based on race of the respondent and therefore contain a large number of missing values. Make sure that you exclude timing (e.g. ftjourn_page_timing) and order variables from your analysis. You should have 16 columns left. (1 point - coding)

(1.2) If a respondent did not answer a feeling thermometer question, the value is coded as a negative number. Replace the negative values with NA and remove all rows that have NA values for any of the selected feeling thermometer questions. (see drop_na function). You should have about 1560 data points left (1 point - coding)

Principal Component Analysis (PCA)

You should now have a data frame that is suitable for a principal component analysis of the feeling thermometer responses.

Perform a principal component analysis of the feeling thermometer responses using step_pca.

(1.3) Create a scree plot of the eigen values. How many components should be considered? (1 point - coding/discussion)

(1.4) Create a biplot using the first two components. You will need to multiply the loadings with a factor to get an improved visualization. (1 point - coding)

(1.5) Interpret the first two components. What do they represent? Check the questionnaire for the questions that were asked. (1 point - discussion)

Explore dataset

(1.6) The ANES 2022 Pilot Study is a rich data set. We can map the respondents profile and responses to other questions onto the principal component scatterplot. We start with the respondents profile. (2 points - coding/discussion)

Select the following profile data:

gender
educ (education level)
marstat (marital status)

Add steps to convert the columns into factors in your data processing pipeline. See the questionnaire for the meaning of the different factor levels.

Combine the data set with the transformed PCA values.

Create scatterplots of the first two components, add a geom_density2d layer, and use facet_wrap to create a separate plot for each factor level.

Interpret the results. Can you see patterns?

(1.7) As an extension of (1.6), we now focus on the answers to the actual questions. Select one of the question categories, formulate an hypothesis and see if you find a correlation with the PCA analysis. The categories are:

2022 Turnout and choice (6-20)
Retrospective turnout and choice 2020 (21-24)
Prospective turnout (25)
Participation (26-33)
Global emotion battery (34-40)
Presidential approval (41-43)
Party identification (44-50)
Ideology (51)
Economic performance (52-54)
Inflation (55-64)
Issue importance (65-79)
Issue ownership (80-92)
Climate change (93-94)
Trust in experts (95-99)
Political disagreement (100-101)
Abortion (102-114)
Abortion emotions (102-114)
Transgender attitudes (123-126)
Guns and crime (127-129)
Imigrant emotions (130-136)
Democratic attitudes / misinformation (137-146)
Electoral integrity (147-159)
Political efficacy (160)
Feeling thermometers (161-179)
Racism (180)
Feminist attitudes (181-185)
Racial resentment (186-190)
Political tolerance (191-193)
Racial stereotypes (194-206)
Identities (207-208)
Identity importance (210-220)
Role of schools (221-226)
Great replacement (227)
Racial privilege (228-235)

Use a similar approach to (1.6) for the analysis (2 points - coding/discussion)

2. Analyzing the ANES 2022 Pilot Study - Clustering

We continue with the analysis of the ANES 2022 Pilot study and cluster the respondents based on their answers to the feeling thermometer questions.

Hierarchical clustering

(2.1) Create a hierarchical clustering using the feeling thermometer data with the tidyclust package. Explore a variety of clustering methods (hier_clust). How many clusters should be considered? (2 points - coding)

k-means clustering

(2.2) Use k-means clustering to cluster the respondents based on their answers to the feeling thermometer questions. Use the tidyclust package. (2 points)

Create a k-means clustering with 5 clusters.
Combine the dataset, the results from the PCA, and the k-means clustering in a tibble. [Hint: add to result from (1.6)]
Create a scatterplot of the first two principal components and color the points by the cluster assignment. Describe your observations.
Applying the tidy command to the fitted k-means model extracts the cluster centroids. Visualize the cluster centers in a parallel coordinate plot and interpret the different clusters. It can be helpful to order the variables for the visualization (use scale_x_discrete(limits=c("fttrans", "ftfem", ...)) where the order is defined by the limits argument).

Explore dataset

Characterize the different clusters.

(2.3) Use the profile data from (1.6) to characterize the different clusters. You can for example visualize the distributions of the different factor levels in a stacked 100%-bar plot (geom_bar(position="fill")). How does the distribution of the different factor levels differ between the clusters? Are your observations in agreement with the analysis from (1.6)? (2 points)

(2.4) Now use the questions from (1.7) to characterize the different clusters. How does the distribution of the different factor levels differ between the clusters? Are your observations in agreement with the analysis from (1.7)? (2 points)