Submission Instructions

Please submit your problem set via GauchoSpace. Submit both the .Rmd file and the HTML file it creates. This assignment is due by 11:55pm on Monday, July 20th. No late problem sets will be accepted. Please write your name below, and list any students you collaborated with.

Name:

Students you collaborated with:

Question 1. Traffic-related deaths and cell phones (5 points total).

These data are about traffic related incidents and cell phone usage across US states. We’re going to use them to see what we can learn about the relationship between cell phone usage and traffic accidents. The dataset is called cellphone_data.Rdata.

The variables are:

Variable Name	Description
year	year
state	state name
state_numeric	a numeric code for each state
population	population within a state
numberofdeaths	number of traffic deaths, in the state
cell_subscription	number of cell phone subscriptions in state in thousands
total_miles_driven	total miles driven within a state for that year (in millions of miles)
cell_ban	state has ban on using cell phone while driving
text_ban	state has ban on texting while driving

a. (1 point) Load the data set and provide a scatterplot with cell_subscription on the horizontal axis and numberofdeaths on the vertical axis. Add the regression line to this graph. Specifically, regress number of deaths on cell subscription and place the fitted line onto the scatterplot using abline. Show the summary for the regression model and correctly interpret the coefficient estimates.

b. (1 point) State your null and alternative hypotheses for this model, and correctly interpret the p-value on your slope coefficient.

c. (1 point total) Should the result in 1a be given a causal interpretation? (0.4 points) Why or why not? (0.3 points) Describe one possible confounder. (0.3 points)

d. (1 point total) Now we will use multivariate regression to begin addressing some of the possible confounders, including population and total miles driven. In other words, we will build a more complex model.

Create a new variable, popmil, that is the population in millions (i.e. divide population by one million to create this new variable.) Store this variable in your dataset using “$”. (0.1 point)

Now regress numberofdeaths on cell_subscriptions, but controlling for population by including popmil and for total_miles_driven by including both variables in the model. (0.1 point)

What are your independent variables? What is your dependent variable? (0.1 point)

Produce a summary table for your model results and correctly interpret these results. That is:

Interpret all the coefficient estimates (0.2 points)
Remark on the significance of each coefficient (except the intercept) (0.1 point)
Most importantly, compared to the previous model, remark on any changes to the estimated coefficient for cell_subscription and its statistical significance. What is going on here? (0.4 points)

e. (1 point) What do you conclude from all of this? Specifically, what do these results make you think about the relationships between these variables? In this model, which factor seems to be a statistically significant predictor of traffic-related deaths?

Question 2. Texting bans (4 points total).

a. (1 point) This dataset also tells us if each state has a ban on texting while driving. Look at the variable text_ban. What values can it take? What is the mean and standard deviation for this variable? Given the mean, would you say that most states do or do not have bans on texting?

Extra credit (0.5 points) What do we call variables like this?

b. (1 point) As a first analysis, you want to know if the mean difference in number of deaths differs in states with and without bans. Get the difference in means, i.e. the mean number of deaths in states with a text ban minus the mean number of deaths in states without text bans. Hint: You need to subset your data using “[ ]” or the subset command.

c. (1 point) Get this difference in means again, but by using regression. (0.4 points) (You should get the same result – if you didn’t, you made a mistake somewhere.) Interpret the coefficient estimates and their p-values (both for the intercept and the slope coefficient). (0.6 points)

d. (1 point) Can you conclude from this model result whether bans work? State why or why not, following the rules we have discussed in class for what this requires of you to argue.

Question 3. Model predictions (1 point total).

a. (0.2 points) Sometimes our variables can be reconceptualized in ways that are more sensible. Let’s make a new variable that tells us the number of cell subscriptions per person. Rather than trying to control for population by including it as a predictor, construct the variable cellperpop equal to the number of cell phone subscriptions divided by the population times one thousand. (We want to multiply the population by a thousand because the cell subscription variable happened to be reported as subscriptions per thousand people). Check that it has a mean of almost 1 (about 0.94) to make sure you made it correctly.

b. (0.2 points) Run a regression of numberofdeaths on text_ban, this time controlling for cell subscriptions per person (cellperpop) and show the regression table using summary.

c. (0.3 points) The actual mean of cellperpop is about 0.94. What is the number of deaths that the model predicts for a hypothetical state with the average cell phone per person (cellperpop=0.94) and that has imposed a text ban?

d. (0.3 points) What is the number of deaths that the model predicts for a hypothetical state with the average cell phone per person (cellperpop=0.94) and that has not imposed a text ban?

OPTIONAL Question 4. Practice and extra credit (bonus 1.5 points total).

Extra credit (0.5 points) Predict which states will impose a ban on texting by regressing a dummy for text_ban on other variables. In this case, what is your dependent variable? Choose whatever variables you would like (except for cell_ban) to predict your outcome variable.

Extra credit (0.5 points) Run your regression model, show the summary, and interpret each of the coefficients correctly. Use the $R^2$ statistic reported by the model (the “multiple R-squared” is fine) to describe how well the model works in terms of the variance of $Y$ that is “explained” by the model.

Extra credit (0.5 points) Get the predicted values (probabilities) from your model for each observation. There are multiple ways of doing this, and you can use any of them you would like (google to figure out the coding options). Choose one variable that you had included in your model and use it to create a scatter plot that has that variable for the horizontal axis and the predicted probability of text_ban=1 for the vertical axis.

Problem Set 4 (due Monday, July 20 at 11:55pm)