Introduction: Reminders about R and RMarkdown

Please make sure you have downloaded this file (pset2.Rmd) to your computer and opened it in RStudio. By download, we do not mean you just clicked on it in your browser – we mean you have saved the actual file to a directory on your computer (as pset2.Rmd, not as pset2.Rmd.txt!), and then opened it with RStudio. If you are working with the online RStudio server at https://pols15.lsit.ucsb.edu/, you should have uploaded the pset2.Rmd file and the accompanying data files into your “Problem Sets” directory. You should now be looking at the “raw” text of the .Rmd file.

If you need to re-orient yourself, please review the introductory material that Problem Set 1 began with describing how to include R code “chunks” into this .Rmd file. Remember that when you “knit” the .Rmd file, only the code written into code “chunks” will be executed and have its results integrated into the output HTML file. For example, the code chunk below provides a summary of the built-in dataset called cars. Take a look at the R code that produces it, then click “knit” and see how it shows up in the outputted HTML file:

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Please also remember that you will want to use the console to “try out” code to get it working. Once you get it working, copy the code that worked (not the results) over into a code chunk in your .Rmd. Remember that the code within your .Rmd file has to be self-contained and include all the steps – your .Rmd file will not “remember” what you did on your own in the console. When you click “knit”, it can only execute the code that was present in the .Rmd. Do not copy the results from your console into your .Rmd file. In addition, do not include large amounts of output in your write-up (i.e. don’t print full datasets to the screen).

Include both the code to get your answer and your answer in words.

Finally, it is best to work will small amounts of code at a time: get some code working, copy it into the .Rmd as a code chunk, write your text answer (outside the code chunk) if needed, and check that the file will still knit properly. Do not proceed to answer more questions until you get the first bit working. This will save you huge headaches.

Make sure your final .Rmd file knits correctly, and check as you work – don’t wait until the very end to try knitting your code.

Submission Instructions

Please submit your problem set via GauchoSpace. Submit both the .Rmd file and the HTML file it creates. This assignment is due by 11:55pm on Friday, July 3rd. No late problem sets will be accepted. Please write your name below, and list any students you collaborated with.

Name:

Students you collaborated with:

Question 1 (6 points total).

In this question, we are going to be working with replication data for the 2003 paper “Ethnicity, Insurgency, and Civil War” by Fearon and Laitin (https://www.jstor.org/stable/3118222). The authors argue that, contrary to conventional wisdom, the factors that explain which countries have been at risk for civil war are not their ethnic or religious characteristics but rather the conditions that favor insurgency.

Question 0 Clear your workspace using code.

This is a good programming habit to get into, even if your current environment is empty. This is done for you in the solution below.

Solution 0

rm(list = ls())

a. (0.1 points) Set your working directory. Load the Fearon and Laitin data set (fl2.RData).

b. (0.3 points total) What are the names of the variables stored in this dataset? (PLEASE DO NOT PRINT THE WHOLE DATASET IN YOUR OUTPUT!) (0.1 points) How many variables do you have? (0.1 points) What is your sample size? (0.1 points)

c. (0.4 points) The variable gdpenl is lagged GDP per capita, measured in thousands of dollars (using 1985 prices). Show the sample distribution of this variable. Specifically, create (1) a density plot (0.2 points) and (2) a boxplot (0.2 points). Remember, plots need to be labelled.

d. (0.2 points) Remark on the shape of the distribution (density plot) in question c.

e. (0.4 points) Compute the median and mean of gdpenl and report their values in your code. Then add these values to your plots in c, i.e. add them as vertical lines in your density plot (0.2 points), and as horizontal lines in your boxplot (0.2 points).

f. (0.2 points) Comment on whether the mean and median are the same and explain why or why not.

g. (0.2 points) Sometimes, we take the log of variables to scale them. Create a new variable which is the (natural) log of gdpenl. Save this to your fl2 data-set using the $ operator.

h. (0.5 points) Show the distribution of log(gdpenl) using a density plot. Display the mean and median of this variable on your density plot (as in question e).

i. (0.2 points) Remark on the difference in shape when using the log of the variable. Are your mean and median closer together or farther apart? Why?

In the same dataset, the variable Oil describes whether each country in the dataset is an oil exporter (Oil=1) or not (Oil=0). The variable war describes how many years from 1945 to 1999 that country had a civil war. The variable ethfrac is a measure of how fractionalized ethnic groups are in a given country – specifically, it’s the probability that two people randomly drawn from a given country are from the same (0) or different (1) groups.

j. (0.6 points) What is the mean value of war for oil exporters? (0.3 points) What is the mean value of war for non oil exporters? (0.3 points)

k. (0.2 points) Describe the ethfrac variable: what is the minimum and maximum?

l. (0.2 points) What is the mean value of ethfrac?

Extra credit (0.5 points) Interpret the mean value of ethfrac in words.

m. (0.6 points) Which country has the highest level of ethnic fractionalization? Which one has the lowest? For full credit, use code to determine this!

n. (1 point total) Say you believe that increased ethnic factionalization causes war. In this case, what is your independent and dependent variable? (0.2 points) Make a scatterplot that shows the relationship between these two variables, including a regression line. Remember to label your plot. (0.4 points) Describe the relationship you find. (0.4 points)

o. (0.4 points) What happens to the predicted number or wars as you move from no ethnic fractionalization to the highest possible value of ethnic fractionalization?

p. (0.5 points) Based on this scatterplot, can you say that increased ethnic fractionalization causes civil war? If your answer is yes, then provide an argument for it. If your answer is no, then argue why not in terms of comparability and confounders (see lecture).

Question 2 (2 points total).

Suppose you have a random variable \(X\) with expectation \(E[X]=\mu\), and variance given by \(\sigma^2\). You then draw multiple observations from the same distribution. That is, you draw \(X_1, X_2,..,X_n\), each a random variable wih expectation \(\mu\) and variance \(\sigma^2\).

a. (0.5 points) When you average these random variables together, what is this called? How do you write it mathematically?

b. (0.5 points) What is the standard deviation of these random variables? How do you write it mathematically?

c. (0.5 points) What is \(E[\overline{X}]\)? Explain with math and words. Specifically, give the name of what \(E[\overline{X}]\) is, and provide the formula in math.

Extra credit (0.5 points) What is the distribution of the sample mean, and by what theorem?

d. (0.5 points) What is \(Var(\overline{X})\)? Explain with math and words. Specifically, give the name of what \(Var(\overline{X})\) is, and provide the formula in math.

Question 3 (2 points total).

a. (1 point) In your own words, explain the difference between these terms and put them in a logical sequence: Estimate, Estimand, Estimator. Give one example of each.

b. (1 point) If you repeatedly draw a sample and take a mean, how will the distribution of the mean change if the variance of the underlying population is small versus big? If your sample size is small versus big? Explain.