Homework 9

C5.2

Use the data in GPS2.RAW for this exercise.

(i) Using all 4,137 observationsl, estimate the equation

colgpa = \beta_{0} + \beta_{1} hsperc + \beta_{2} sat + u

and report theresults in the standard form.

The regression results with the full data set are

\begin{array}{llll} \widehat{colgpa} = & 1.392 & -0.01352\ hsperc & +0.001476\ sat\\ & (0.0715) & (0.000549) & (6.531) \\ \multicolumn{4}{l}{n = 4137, R^{2} = 0.273, Adjusted R^{2} = 0.273}\\ \end{array}

(ii) Reestimate the equation in part (i), using the first 2,070 observations.

The regression results with the first 2070 points is

\begin{array}{llll} \widehat{colgpa} = & 1.436 & -0.01275\ hsperc & +0.001468\ sat\\ & (0.0978) & (0.000719) & (8.858) \\ \multicolumn{4}{l}{n = 2070, R^{2} = 0.283, Adjusted R^{2} = 0.282}\\ \end{array}

(iii) Find the ratio of the standard errors on hsperc from parts (i) and (ii). Compare this with the results from (5.10).

The ratio of the standard errors is

0.000549/0.000719 = 0.764

Equation (5.10)

Se(\widehat{\beta_{j}}) \approx c_{j}/\sqrt{n}

where c_{j} is a positive constant that should be approximately the same for any n. That is,

Se(\widehat{\beta_{0}}) \sqrt{n_{0}} \approx Se(\widehat{\beta_{1}}) \sqrt{n_{1}}

so that

Se(\widehat{\beta_{0}}) / Se(\widehat{\beta_{1}}) \approx  \sqrt{n_{1}} / \sqrt{n_{0}}

For parts (i) and (ii)

\sqrt{2070} / \sqrt{4137} = 0.707

This is within about 7.5% of the ratio of the standard errors.

For the full dataset, the constant c_{j} is 0.0353.

To examine convergence, compare this value for c_{j} with values from subsets of different sizes. That is,
regress this model on two sets of 2068 (half the dataset), then on three sets of 1379, four sets of 1034, five sets of a 827,
and ten sets of 413. From each regression, multiply the standard error for \beta_{1} by the square root of the number
of samples and plot these against the number of samples. The plot is shown here, and the sample R code is shown below.

References

Problem and data from Wooldridge Introductory Econometrics: A Modern Approach, 4e.

Listing of the R program.

# 
#  10 Oct 2012 D.S.Dixon
# 

# GPA2.DES
# 
# sat       tothrs    colgpa    athlete   verbmath  hsize     hsrank    hsperc   
# female    white     black     hsizesq   
# 
#   Obs:  4137
# 
#   1. sat                      combined SAT score
#   2. tothrs                   total hours through fall semest
#   3. colgpa                   GPA after fall semester
#   4. athlete                  =1 if athlete
#   5. verbmath                 verbal/math SAT score
#   6. hsize                    size graduating class, 100s
#   7. hsrank                   rank in graduating class
#   8. hsperc                   high school percentile, from top
#   9. female                   =1 if female
#  10. white                    =1 if white
#  11. black                    =1 if black
#  12. hsizesq                  hsize^2
# 

source("RegReportLibrary.R")

mydata <- read.table("gpa2.csv", sep=",",  header = TRUE, na.strings = ".")

myfit0 <- lm(colgpa~hsperc + sat, data=mydata)
output0 <- summary(myfit0)
print(output0)
wordpressFormat(myfit0)

nfull <- length(output0$residuals)
sefull <- output0$coefficients[2,2]

myfit1 <- lm(colgpa~hsperc + sat, data=mydata[1:2070,])
output1 <- summary(myfit1)
print(output1)
wordpressFormat(myfit1)

separt <- output1$coefficients[2,2]
npart <- length(output1$residuals)

cat("ratio of standard errors: ",(sefull/separt),"\n")
cat("ratio of square root observations ",((npart/nfull)^0.5),"\n")

N <- length(mydata$colgpa)

N1 <- as.integer(N/10)
N2 <- as.integer(N/5)
N3 <- as.integer(N/4)
N4 <- as.integer(N/3)
N5 <- as.integer(N/2)

mat <- matrix(nrow=25,ncol=2)
row <- 1

sqrtn <- sqrt(N1)
for(i in 0:9){
 start <- N1 * i
 end <- N1 * (i + 1) - 1
 output1 <- summary(lm(colgpa~hsperc + sat, data=mydata[start:end,]))
 mat[row,] <- c(N1,output1$coefficients[2,2]*sqrtn)
 row <- row + 1
}

sqrtn <- sqrt(N2)
for(i in 0:4){
 start <- N2 * i
 end <- N2 * (i + 1) - 1
 output1 <- summary(lm(colgpa~hsperc + sat, data=mydata[start:end,]))
 mat[row,] <- c(N2,output1$coefficients[2,2]*sqrtn)
 row <- row + 1
}

sqrtn <- sqrt(N3)
for(i in 0:3){
 start <- N3 * i
 end <- N3 * (i + 1) - 1
 output1 <- summary(lm(colgpa~hsperc + sat, data=mydata[start:end,]))
 mat[row,] <- c(N3,output1$coefficients[2,2]*sqrtn)
 row <- row + 1
}

sqrtn <- sqrt(N4)
for(i in 0:2){
 start <- N4 * i
 end <- N4 * (i + 1) - 1
 output1 <- summary(lm(colgpa~hsperc + sat, data=mydata[start:end,]))
 mat[row,] <- c(N4,output1$coefficients[2,2]*sqrtn)
 row <- row + 1
}

sqrtn <- sqrt(N5)
for(i in 0:1){
 start <- N5 * i
 end <- N5 * (i + 1) - 1
    output1 <- summary(lm(colgpa~hsperc + sat, data=mydata[start:end,]))
    mat[row,] <- c(N5,output1$coefficients[2,2]*sqrtn)
    row <- row + 1
}

sqrtn <- sqrt(N)
mat[row,] <- c(N,output0$coefficients[2,2]*sqrtn)

## make a plot of constants
png("f0d62b075365c15a9efcad7a9c046938.png")
plot(mat, xlab="number of points per dataset", ylab="stderr")
dev.off()

cat("c = ",output0$coefficients[2,2]*sqrtn,"\n")

Homework 8

C4.2

(i) Using the same model as Problem 3.4, state and test the null hypothesis that the rank of law schools has no ceteris paribus effect on median starting salary.

The median starting salary for new law school graduates is determined by

log(salary) = β0 + β1 LSAT + β2 GPA + β3 log(libvol) + β4 log(cost) + β5 rank + u

where LSAT is the median LSAT score for the graduating class, GPA is the median college GPA for the class, libvol is the number of volumes in the law school library, cost is the annual cost of atttending law school, and rank is a law school ranking (with rank = 1 being the best).

The null hypothesis that rank has no effect on log(salary) is

H0: β5 = 0

The alternative is

H1: β5 ≠ 0

which is a two-tailed test.

The regression results are:

    Coefficients:
                      Estimate Std. Error t value Pr(>|t|)
    (Intercept)  8.3432330  0.5325192  15.667  < 2e-16 ***
    LSAT         0.0046964  0.0040105   1.171  0.24373    
    GPA          0.2475247  0.0900370   2.749  0.00683 ** 
    llibvol      0.0949926  0.0332544   2.857  0.00499 ** 
    lcost        0.0375544  0.0321061   1.170  0.24426    
    rank        -0.0033246  0.0003485  -9.541  < 2e-16 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

    Residual standard error: 0.1124 on 130 degrees of freedom
        (20 observations deleted due to missingness)
    Multiple R-squared: 0.8417, Adjusted R-squared: 0.8356 
    F-statistic: 138.2 on 5 and 130 DF,  p-value: < 2.2e-16

The estimated model is

log(salary) = 8.34 + 0.00470 LSAT + 0.248 GPA + 0.0950 log(libvol) +
                    (0.53)    (0.0040)                 (0.090)         (0.033)
         0.0376 log(cost) – 0.00332 rank
           (0.032)                 (0.00035)
n = 136, Adj. R2 = 0.8356

Note from the regression results that the t-value is -9.541 which is highly significant. The critical value at one percent significance for 120 degrees of freedom is 2.617, and clearly |t|>>2.617. Thus we reject the null hypothesis.

(ii) Are features of the incoming class of students — namely, LSAT and GPA — individually or jointly significant for explaining salary? (Be sure to account for missing data on LSAT and GPA.)

Based on the t-value in the regression results, the coefficient on LSAT is not significant but GPA is significant at the 1% level, which would dominate any joint significance.

unrestricted SSR = 1.6427, df = 130
restricted SSR = 1.8942, df = 132
F(2,130) = 9.95, 1% critical value for F(2,inf) = 4.61

Thus, at the 1% level, we reject the null hypothesis that LCAT and GPA are jointly insignificant. From the R linearHypothesis test,

    Linear hypothesis test

    Hypothesis:
    LSAT = 0
    GPA = 0

    Model 1: restricted model
    Model 2: lsalary ~ LSAT + GPA + llibvol + lcost + rank

        Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
    1    132 1.8942                                  
    2    130 1.6427  2   0.25151 9.9517 9.518e-05 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Here, at the 0.1% level, we reject the null hypothesis that LCAT and GPA are jointly insignificant.

(iii) Test whether incoming class size (clsize) or the size of the faculty (faculty) needs to be added to this equation; carry out a single test. (Be careful to account for missing data on clsize and faculty.)

To test the joint significance of these variables, first create a model that includes them. Then, do the test manually with an unrestricted regression of this model, then a restricted model that omits these two variables.

For the unrestricted model

unrestricted SSR = 1.5732, df = 123
restricted SSR = 1.5974, df = 125
F(2,123) = 0.9484, 10% critical value for F(2,120) = 2.35

Thus, at the 10% level, we fail to reject the null hypothesis that clsize and faculty are jointly insignificant. From the R linearHypothesis test,

    Linear hypothesis test

    Hypothesis:
    clsize = 0
    faculty = 0

    Model 1: restricted model
    Model 2: lsalary ~ LSAT + GPA + llibvol + lcost + rank + clsize + faculty

        Res.Df    RSS Df Sum of Sq      F Pr(>F)
    1    125 1.5974                           
    2    123 1.5732  2  0.024259 0.9484 0.3902

Here, at the 39% level, we fail to reject the null hypothesis that clsize and faculty are jointly insignificant.

(iv) What factors might influence the rank of the law school that are not included in the salary regression?

There are salary differences based on gender and race/ethnicity across the labor market, and these may have some correlation with rank for a few law schools, but probably not over the entire data set. Individual programs are frequently ranked by the frequency and quality of publications by their faculty, so that is very likely to correlate with rank.

References

Problem and data from Wooldridge Introductory Econometrics: A Modern Approach, 4e.

Listing of the R program.

# 
#  1 Oct 2012 D.S.Dixon
# 
# 
# LAWSCH85.DES
# 
# rank      salary    cost      LSAT      GPA       libvol    faculty   age      
# clsize    north     south     east      west      lsalary   studfac   top10    
# r11_25    r26_40    r41_60    llibvol   lcost     
# 
#   Obs:   156
# 
#   1. rank                     law school ranking
#   2. salary                   median starting salary
#   3. cost                     law school cost
#   4. LSAT                     median LSAT score
#   5. GPA                      median college GPA
#   6. libvol                   no. volumes in lib., 1000s
#   7. faculty                  no. of faculty
#   8. age                      age of law sch., years
#   9. clsize                   size of entering class
#  10. north                    =1 if law sch in north
#  11. south                    =1 if law sch in south
#  12. east                     =1 if law sch in east
#  13. west                     =1 if law sch in west
#  14. lsalary                  log(salary)
#  15. studfac                  student-faculty ratio
#  16. top10                    =1 if ranked in top 10
#  17. r11_25                   =1 if ranked 11-25
#  18. r26_40                   =1 if ranked 26-40
#  19. r41_60                   =1 if ranked 41-60
#  20. llibvol                  log(libvol)
#  21. lcost                    log(cost)
# 

mydata <- read.table("LAWSCH85.csv", sep=",",  header = TRUE, na.strings = ".", )

print(summary(mydata))

# eliminate missing data in LSAT and GPA
cleandata<-mydata[!is.na(mydata$LSAT) & !is.na(mydata$GPA),]

# unrestricted model
myfit0<-lm(lsalary~LSAT+GPA+llibvol+lcost+rank, data=cleandata)
fitsum0 <- summary(myfit0)

SSR0 <- deviance(myfit0)
df0 <- df.residual(myfit0)

print(fitsum0)
print(paste("n = ",length(myfit0$residuals)))

# restricted model (LSAT=0, GPA=0)
myfit0Rest<-lm(lsalary~llibvol+lcost+rank, data=cleandata)
fitsum0Rest <- summary(myfit0Rest)

SSR0Rest <- deviance(myfit0Rest)
df0Rest <- df.residual(myfit0Rest)

print(fitsum0Rest)
print(paste("n = ",length(myfit0Rest$residuals)))

print(paste("SSR0 = ",SSR0))
print(paste("SSR0Rest = ",SSR0Rest))
q <- df0Rest - df0
nk1 <- df0
print(paste("df0 = ",df0))
print(paste("df0Rest = ",df0Rest))
print(paste("q = ",q))
print(paste("nk1 = ",nk1))
F0 <- ((SSR0Rest - SSR0)/q)/(SSR0/nk1)
print("old school")
print(paste("F = ",F0))

library(car)

# joint significance of LSAT and GPA
hypmatrix <- rbind(c(0,1,0,0,0,0),c(0,0,1,0,0,0))
rhs <- c(0,0)

myfit<-lm(lsalary~LSAT+GPA+llibvol+lcost+rank, data=mydata)
hyp <- linearHypothesis(myfit, hypmatrix, rhs)

print("new school")
print(hyp)

# now eliminate missing data from clsize and faculty
cleanerdata<-cleandata[!is.na(cleandata$clsize) & !is.na(cleandata$faculty),]

print(summary(cleanerdata))

# unrestricted model
myfit1<-lm(lsalary~LSAT+GPA+llibvol+lcost+rank+clsize+faculty, data=cleanerdata)
fitsum1 <- summary(myfit1)
SSR1 <- deviance(myfit1)
df1 <- df.residual(myfit1)

# restricted model (clsize=0, faculty=0)
myfit1Rest<-lm(lsalary~LSAT+GPA+llibvol+lcost+rank, data=cleanerdata)
fitsum1Rest <- summary(myfit1Rest)
SSR1Rest <- deviance(myfit1Rest)
df1Rest <- df.residual(myfit1Rest)

print("Testing clsize and faculty")

print(paste("n = ",length(myfit1Rest$residuals)))

print(paste("SSR1 = ",SSR1))
print(paste("SSR1Rest = ",SSR1Rest))
q <- df1Rest - df1
nk1 <- df1
print(paste("df1 = ",df1))
print(paste("df1Rest = ",df1Rest))
print(paste("q = ",q))
print(paste("nk1 = ",nk1))
F1 <- ((SSR1Rest - SSR1)/q)/(SSR1/nk1)
print("old school")
print(paste("F = ",F1))

library(car)

# joint significance of LSAT and GPA
hypmatrix <- rbind(c(0,0,0,0,0,0,1,0),c(0,0,0,0,0,0,0,1))
rhs <- c(0,0)

myfit<-lm(lsalary~LSAT+GPA+llibvol+lcost+rank+clsize+faculty, data=mydata)
hyp <- linearHypothesis(myfit, hypmatrix, rhs)

print("new school")
print(hyp)

Homework 7

4.2

Consider an equation to explain salaries of CEOs in terms of annual firm sales, return on equity (roe, in perentage form), and return the the firm’s stock (ros, in percentage form):

log(salary) = β0 + β1 log(sales) + β2 roe + β3 ros + u

(i) In terms of the model parameters, state the null hypothesis that, after controlling for sales and roe, ros has no effect on CEO salary. State the alternative that better stock market performance increase a CEO’s salary.

The null hypothesis that ros has no effect is

H0: β3 = 0

The alternative that better stock market performance increases a CEO’s salary is

H1: β3 > 0

That is, a one-tailed test.

(ii) Using the data in CEOSAL1.RAW, the following equation was obtained by OLS:

log(salary) = 4.32 + .280 log(sales) + .0174 roe + .00024 ros
                    (.32)    (.035)                 (.0041)         (.00054)
n = 206, R2 = .283

By what percetage is salary predicted to increase if ros increases by 50 points? Does ros have a practically large effect on salary?

For an increase of ros by 50 points, the proportional effect on salary is .00024(50) = 0.012, or 1.2%. Practically, this is a very small change in salary for a very dramatic change in stock performance.

(iii) Test the null hypothesis that ros has no effect on salary against the alternative that ros has a positive effect. Carry out the test at the 10% signficance level.

From Table G.2, the 10% critical value for a one-tailed test with infinite degrees of freedom is 1.282. The t-statistic of ros is .00024/.00054 = 0.44, which is much less than the critical value. Thus, we fail to reject the null hypothesis at the 10% significance level.

(iv) Would you include ros in a final model explaining CEO compensation in terms of firm performance? Explain.

I would include it. Since the other variables are highly significant, it is unlikely that including ros is having any negative impact. Many readers will assume that ros affects CEO salary, so addressing the question, even if a bit ambiguously, is instructive.

References

Problem and data from Wooldridge Introductory Econometrics: A Modern Approach, 4e.

Homework 6

3.12

The following equation represents the effects of tax revenue mix on subsequent employment growth for a population of counties in the United States:

growth = β0 + β1 shareP + β2 shareI + β3 shareS + other factors

where growth is the percentage change in employment from 1980 to 1990, shareP is the share of property taxes in total tax revenue, shareI is the share of income tax revenues, and shareS is the share of sales tax revenues. All of these variables are measured in 1980. The omitted share, shareF, includes fees and miscellaeous taxes. By definition, the four shares add up to one. Other factors would include expenditure on education, infrastructure, and so on (all measured in 1980).

(i) Why must we omit one of the tax share variables from the equation?

Because all of the shares add up to one, specifying three of the shares unambiguously specifies the fourth. Thus the fourth is not statistically independent, as required by Gauss-Markov. Note that varying any one of the included three shares ceteris paribus means, by definition, that the omitted share is also being varied. Were it to be included, there would be no way to vary any of the shares ceteris paribus, and therefore no way to interpret the coefficients.

(ii) Give a careful interpretation of β1.

β1 is the property-tax-share marginal change in percent employment. That is, the percent by which employment changes per unit change in property tax share. A unit change in share makes no sense however, since it must be between zero and one. The quantity (0.01)β1, however, is the percent change in employment per change in property tax as a percent of total taxes. Note that a ceteris paribus increase in shareP by one percent means, necessarily, a decrease in shareF by one percent.

C3.6

Use the data set in WAGE2.RAW for this problem. As usual, be sure all of the following regressions contain an intercept.

(i) Run a simple regression of IQ on educ to obtain the slope coefficient, say δ1.

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  53.6872     2.6229   20.47   <2e-16 ***
educ          3.5338     0.1922   18.39   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 12.9 on 933 degrees of freedom
Multiple R-squared: 0.2659, Adjusted R-squared: 0.2652 
F-statistic:   338 on 1 and 933 DF,  p-value: < 2.2e-16

The estimated model is

IQ = 53.6872 + 3.5338 educ

so

δ1 = 3.5338

(ii) Run the simple regression of log(wage) on ecud, and obtain the slope coefficient, β1.

    Coefficients:
                    Estimate Std. Error t value Pr(>|t|)
    (Intercept) 5.973062   0.081374   73.40   <2e-16 ***
    educ        0.059839   0.005963   10.04   <2e-16 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

    Residual standard error: 0.4003 on 933 degrees of freedom
    Multiple R-squared: 0.09742,    Adjusted R-squared: 0.09645 
    F-statistic: 100.7 on 1 and 933 DF,  p-value: < 2.2e-16

The estimated model is

ln(wage) = 5.9731 + 0.059839 educ

so

β1 = 0.059839

(iii) Run the multiple regression of log(wage) on educ and IQ, and obtain the slope coefficients, β1 and β2, respectively.

    Coefficients:
                     Estimate Std. Error t value Pr(>|t|)
    (Intercept) 5.6582876  0.0962408  58.793  < 2e-16 ***
    educ        0.0391199  0.0068382   5.721 1.43e-08 ***
    IQ          0.0058631  0.0009979   5.875 5.87e-09 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

    Residual standard error: 0.3933 on 932 degrees of freedom
    Multiple R-squared: 0.1297, Adjusted R-squared: 0.1278 
    F-statistic: 69.42 on 2 and 932 DF,  p-value: < 2.2e-16

The estimated model is

ln(wage) = 5.6582876 + 0.039120 educ + 0.0058631 IQ

so

β1 = 0.039120
β2 = 0.0058631

(iv) Verify that β1 = β1 + β2δ1.

β1 = β1 + β2δ1 = 0.039120 + (0.0058631)(3.5338) = 0.059839

identical with the result in part (ii).

References

Problem and data from Wooldridge Introductory Econometrics: A Modern Approach, 4e.

Listing of the R program used to answer these questions.

# 
#  1 Oct 2012 D.S.Dixon
# 
# 
# WAGE2.DES
# 
# wage      hours     IQ        KWW       educ      exper     tenure    age      
# married   black     south     urban     sibs      brthord   meduc     feduc    
# lwage     
# 
#   Obs:   935
# 
#   1. wage                     monthly earnings
#   2. hours                    average weekly hours
#   3. IQ                       IQ score
#   4. KWW                      knowledge of world work score
#   5. educ                     years of education
#   6. exper                    years of work experience
#   7. tenure                   years with current employer
#   8. age                      age in years
#   9. married                  =1 if married
#  10. black                    =1 if black
#  11. south                    =1 if live in south
#  12. urban                    =1 if live in SMSA
#  13. sibs                     number of siblings
#  14. brthord                  birth order
#  15. meduc                    mother's education
#  16. feduc                    father's education
#  17. lwage                    natural log of wage
# 

mydata <- read.table("WAGE2.csv", sep=",",  header = TRUE, na.strings = ".")

print(summary(mydata))

myfit0<-lm(IQ~educ, data=mydata)

print(summary(myfit0))

myfit1<-lm(lwage~educ, data=mydata)

print(summary(myfit1))

myfit2<-lm(lwage~educ+IQ, data=mydata)

print(summary(myfit2))

Homework 5

3.4

The median starting salary for new law school graduates is determined by

log(salary) = β0 + β1 LSAT + β2 GPA + β3 log(libvol) + β4 log(cost) + β5 rank + u

where LSAT is the median LSAT score for the graduating class, GPA is the median college GPA for the class, libvol is the number of volumes in the law school library, cost is the annual cost of atttending law school, and rank is a law school ranking (with rank = 1 being the best).

(i) Explain why we expect β5 ≤ 0.

The lower the rank the higher the perceived quality, so we expect the marginal benefit of rank to be negative.

(ii) What signs do you expect for the other slope parameters? Justify your answers.

Ceteris paribus, better students get better salaries, so positive values are expected for cofficients on LSAT and GPA1 and β2, respectively). Ceteris paribus, graduates from better schools get better salaries. Positive values are expected for cofficients on log(libvol) and cost3 and β4, respectively) as they are proxies for overall school quality.

(iii) Using the data in LAWSCH85.RAW, the estimated equation is

log(salary) = 8.34 + .0047 LSAT + .248 GPA + .095 log(libvol) + .038 log(cost) – .0033 rank
n = 136, R2 = .842

What is the predicted ceteris paribus difference in salary for schools with a median GPA different by one point? (Report your answer as a percentage.)

The coefficient on GPA is .248, meaning that, ceteris paribus, a one point difference in GPA is predicted to result in a 24.8% change in starting salary.

(iv) Interpret the coefficient on the variable log(libvol).

The cofficient on log(libvol) is the library-volume elasticity of salary. That is, cetris paribus, a one percent change in the number of volumes in the library is predicted to result in a 0.095% change in starting salary.

(v) Would you say it is better to attend a higher ranked law school? How much is a difference in ranking of 20 worth in terms of predicted starting salary?

For “better” measured as starting salary, there is a strong correlation between rank and log(salary). A -20 position difference in rank, ceteris paribus, is predicted to result in a

Δlog(salary) = -0.0033 (-20) = .066

or a 6.6% higher starting salary.

C3.4

(i) Obtain the minimum, maximum, and average values for the variables atndrte, priGPA, and ACT.

column minimum maximum mean
atndrte 6.25 100.00 81.71
priGPA 0.857 3.930 2.587
ACT 13.00 32.00 22.51

(ii) Estimate the model

atndrte = β0 + β1 priGPA + β2 ACT + u

and write the results in equation form. Interpret the intercept. Does it have a useful meaning?

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   75.700      3.884   19.49   <2e-16 ***
priGPA        17.261      1.083   15.94   <2e-16 ***
ACT           -1.717      0.169  -10.16   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 14.38 on 677 degrees of freedom
Multiple R-squared: 0.2906, Adjusted R-squared: 0.2885 
F-statistic: 138.7 on 2 and 677 DF,  p-value: < 2.2e-16

The estimated model is

atndrte = 75.7 + 17.26 priGPA – 1.717 ACT

The intercept is the attendance rate for a student with zero GPA and zero ACT score. Neither of these is likely. Furthermore, the data set has no values anywhere near these, as shown in part (i) above.

(iii) Discuss the estimated slope coefficients. Are there any surprises?

All of the cofficients are significant at any level. The cofficient on priGPA is large and positive, indicating a strong correlation between grades and attendance. The cofficient on ACT is about the same magnitude (given the ranges of priGPA and ACT) yet negaive. This indicates that a test of aptitude taken a year or two before college is negatively correlated with attendance. There are many possible interpretations, including that the students with higher aptitude don’t need to attend classes as frequently, or that talented high school students are lazy college students, or that talent goes away after high school graduation.

(iv) What is the predicted atndrte if priGPA = 3.65 and ACT = 20? What do you make of this result? Are there any students in the sample with these values of the explanatory variables?

The predicted atndrte is

atndrte = 75.7 + 17.26 (3.65) – 1.717 (20) = 104.3

Given that this represents an attendance rate greater than one hundred percent, it can be interpreted as being within the residuals for attendance of 100%. There is one student in the data set with priGPA = 3.65 and ACT = 20, and that student has atndrte = 87.5.

(v) If Student A has priGPA = 3.65 and ACT = 21 and Student B has priGPA = 2.1 and ACT = 26, what is the predicted difference in their attendance rates?

The difference is

Δatndrte = 17.26 (3.1 – 2.1) – 1.717 (21 – 26) = 25.8

References

Problems and data from Wooldridge Introductory Econometrics: A Modern Approach, 4e.

Listing of the R program used to answer these questions.

# 
#  25 Sep 2012 D.S.Dixon
# 
# 
# ATTEND.DES
# 
# attend    termGPA   priGPA    ACT       final     atndrte   hwrte     frosh    
# soph      skipped   stndfnl   
# 
# 
#   Obs:   680
# 
#   1. attend                   classes attended out of 32
#   2. termGPA                  GPA for term
#   3. priGPA                   cumulative GPA prior to term
#   4. ACT                      ACT score
#   5. final                    final exam score
#   6. atndrte                  percent classes attended
#   7. hwrte                    percent homework turned in
#   8. frosh                    =1 if freshman
#   9. soph                     =1 if sophomore
#  10. skipped                  number of classes skipped
#  11. stndfnl                  (final - mean)/sd
# 

mydata <- read.table("attend.csv", sep=",",  header = TRUE, na.strings = ".")

print("atndrte:")
print(summary(mydata$atndrte))
print("priGPA:")
print(summary(mydata$priGPA))
print("ACT:")
print(summary(mydata$ACT))

myfit<-lm(atndrte~priGPA+ACT, data=mydata)

print(summary(myfit))

print(predict(myfit,data.frame(priGPA=3.65,ACT=20)))

print(mydata[mydata$priGPA==3.65 & mydata$ACT==20,])

print(predict(myfit,data.frame(priGPA=3.1,ACT=21)))

print(predict(myfit,data.frame(priGPA=2.1,ACT=26)))

Homework 4

Problem C2.6

(i) Do you think each additional dollar spent as the same effect on the pass rate, or does a diminishing effect seem more appropriate? Explain.

With cross-sectional data covering a wide range of spending levels, ceteris paribus, an additional dollar spent at a school in an upper-middle-class neighborhood is likely to have much less impact than a dollar spent at a low-income school. This argues for a diminishing effect of dollars spent on math test pass rate.

(ii) In the population model

math10 = β0 + β1 ln(expend) + u

argue that β1/10 is the percentage point change in math10 given a 10% increase in expend.

Note that

Δmath10 = β1 Δln(expend) ≈ β1 (Δexpend/expend)

so if

Δexpend/expend = 10% = 1/10

then

%Δmath10 = 10 → Δmath10 = β1/10

(iii) Use the data in MEAP93.RAW to estimate the model from part (ii). Report the estimated equation in the usual way, including the sample size and R-squared.

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -69.341     26.530  -2.614 0.009290 ** 
lexpend       11.164      3.169   3.523 0.000475 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 10.35 on 406 degrees of freedom
Multiple R-squared: 0.02966,    Adjusted R-squared: 0.02727 
F-statistic: 12.41 on 1 and 406 DF,  p-value: 0.0004752

N = 408

The estimated equation is

math10 = -69.3 + 11.16 ln(expend)
n = 408, R2 = 0.0297

(iv) How big is the estimated spending effect? Namely, if spending increases by 10%, what is the estimated percentage point increase in math10?

For a 10% increase in spending, there will be a 1.1% increase in math10.

(v) One might worry that regression analysis can produce fitted values for math10 that are greater than 100. Why is this not mucy of a worry in this data set?

The summary for lexpend is

Min.   :8.111
1st Qu.:8.248
Median :8.330
Mean   :8.370
3rd Qu.:8.447
Max.   :8.912

so that even for the highest value of lexpend

math10 = -69.3 + 11.16 (8.912) = 30.2

That is, the highest predicted score is slightly more than 30%. Similary, the lowest predicted score is

math10 = -69.3 + 11.16 (8.111) = 21.2

References

Problems and data from Wooldridge Introductory Econometrics: A Modern Approach, 4e.

Listing of the R program used to answer these questions.

# 
#  30 Sep 2012 D.S.Dixon
# 
# MEAP93.DES
# 
# lnchprg   enroll    staff     expend    salary    benefits  droprate  
# gradrate math10    sci11     totcomp   ltotcomp  lexpend   lenroll   lstaff    bensal   
# lsalary   
# 
#   Obs:   408
# 
#   1. lnchprg                  perc. of studs. in sch. lunch prog.
#   2. enroll                   school enrollment
#   3. staff                    staff per 1000 students
#   4. expend                   expend. per stud., $
#   5. salary                   avg. teacher salary, $
#   6. benefits                 avg. teacher benefits, $
#   7. droprate                 school dropout rate, perc
#   8. gradrate                 school graduation rate, perc
#   9. math10                   perc studs passing MEAP math
#  10. sci11                    perc studs passing MEAP science
#  11. totcomp                  salary + benefits
#  12. ltotcomp                 log(totcomp)
#  13. lexpend                  log of expend
#  14. lenroll                  log(enroll)
#  15. lstaff                   log(staff)
#  16. bensal                   benefits/salary
#  17. lsalary                  log(salary)
# 

mydata <- read.table("MEAP93.csv", sep=",", header = TRUE, na.strings = ".")

myfit<-lm(math10~lexpend, data=mydata)
print(summary(myfit))

print(paste("N =",length(myfit$residuals)))

print(summary(mydata))

Homework 3

Problem C2.4

(i) Find the average salary and average IQ in the sample. What is the sample standard deviation of IQ? (IQ scores are standardized so that the average in the population is 100 with a standard deviation equal to 15.)

The mean of wage is $957.90. The distribution of wage is shown in the next figure. There is no obvious censoring, or pile-up at the extrema.

The mean of IQ is 101.3, with sample standard deviation of 15.05. The distribution of IQ is shown in the next figure. There is no obvious censoring.

Note that the sampling standard deviation for IQ is 15/sqrt(935) = 0.491. The sample mean is 2.65 sampling standard deviations from the population mean.

(ii) Estimate a simple regression model where a one-point increase in IQ changes wage by a constant dollar amount. Use this model to find the predicted increase in wage for an increase in IQ of 15 points. Does IQ explain most of the variation in wage?

The regression model is

wage = β0 + β1 IQ + u

The regression results are:

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 116.9916    85.6415   1.366    0.172    
IQ            8.3031     0.8364   9.927   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 384.8 on 933 degrees of freedom
Multiple R-squared: 0.09554,    Adjusted R-squared: 0.09457 
F-statistic: 98.55 on 1 and 933 DF,  p-value: < 2.2e-16

In this model, β1 is the change in monthly wage as a function of a one point change in IQ. Based on this regression, a 15 point increase in IQ results in a 15 * $8.30 = $124.50 increase in monthly wage.

R-squared of the fit suggests that IQ explains less than 9.5% of the variation in wage.

(iii) Now, estimate a model where each one-point increase in IQ has the same percentage effect on wage. If IQ increases by 15 points, what is the approximate percentage increase in predicted wage?

This regression model is

ln(wage) = β0 + β1 IQ + u

The regression results are:

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 5.8869942  0.0890206   66.13   <2e-16 ***
IQ          0.0088072  0.0008694   10.13   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3999 on 933 degrees of freedom
Multiple R-squared: 0.09909,    Adjusted R-squared: 0.09813 
F-statistic: 102.6 on 1 and 933 DF,  p-value: < 2.2e-16

In this model, β1 is the fractional change in monthly wage as a function of a one point change in IQ. Based on this regression, a 15 point increase in IQ results in a 15 * 0.00881 = 0.1322 fraction increase, or 13.22% increase, in monthly wage.

For comparison with part ii), from mean wage this represents an increase of $126.60.

References

Problems and data from Wooldridge Introductory Econometrics: A Modern Approach, 4e.

Listing of the R program used to answer these questions.

# 
#  17 Sep 2012 D.S.Dixon
# 
#
#
#  Obs:   935
#
#  1. wage                     monthly earnings
#  2. hours                    average weekly hours
#  3. IQ                       IQ score
#  4. KWW                      knowledge of world work score
#  5. educ                     years of education
#  6. exper                    years of work experience
#  7. tenure                   years with current employer
#  8. age                      age in years
#  9. married                  =1 if married
# 10. black                    =1 if black
# 11. south                    =1 if live in south
# 12. urban                    =1 if live in SMSA
# 13. sibs                     number of siblings
# 14. brthord                  birth order
# 15. meduc                    mother's education
# 16. feduc                    father's education
# 17. lwage                    natural log of wage
#

mydata <- read.table("wage2.csv", sep=",", header = TRUE, na.strings = ".")

print(summary(mydata))

print(sd(mydata$IQ))

## make a histogram of wage
png("682eeb7d6fc88c363ace4cc5d2ddcd3a.png")
hist(mydata$wage,breaks=50)
dev.off()

## make a histogram of IQ
png("5c6185bdce0b6fbb6191ecf199d18ea3.png")
hist(mydata$IQ,breaks=50)
dev.off()

myfit<-lm(wage~IQ, data=mydata)
print(summary(myfit))

mylfit<-lm(lwage~IQ, data=mydata)
print(summary(mylfit))

Homework 2

Problem 2.2

In the simple linear regression model y = β0 + β1 x + u, suppose that E(u) ≠ 0. Letting α0 = E(u), show that the model can always be rewritten with the same slope, but a new intercept and error, where the new error has a zero expected value.

Let

u = α0 + v

where

E(v) = 0

Now

y = (α0 + β0) + β1 x + v

which has the same slope β1, but intercept α0 + β0, and error v such that E(v) = 0.

Problem 2.4

The data set BWGHT.RAW contains data on births to women in the United States. Two variables of interest are the dependent variable, infant birth weight in ounces (bwght), and an explanatory variable, average number of cigarettes the mother smoked per day during pregnancy (cigs). The following simple regression was estimated using data on n = 1,388 births:

bwght = 119.77 – 0.514 cigs

(i) What is the predicted birth weight when cigs = 0? What about when cigs = 20 (one pack per day)? Comment on the difference.

This model predicts a birth weight of 119.77 ounces with cigs = 0, and 109.49 ounces when cigs = 20. That is, a pack a day lowers birth weight by slightly more than ten ounces.

(ii) Does this simple regression necessarily capture the causal relationship between the child’s birth weight and the mother’s smoking habits? Explain.

Birth weight is likely to be a function of many factors. Smoking may be one of those, or it may just be correlated other factors. That is, this relationship does not capture a causal relationship.

(iii) To predict a birth weight of 125 ounces, what would cigs have to be? Comment.

For this model to predict a birth weight of 125 ounces, cigs would have to be about -10.18. A birth weight of 125 ounces, therefore, is probably in the residual for cigs = 0.

(iv) The proportion of women in the sample who do not smoke while pregnant is about 0.85. Does this help reconcile your finding from part (iii)?

For 85% of the women in this study, smoking does not predict any of the variation in birth weight, supporting the notion that a birth weight of 125 ounces is in the residual for cigs = 0.

References

Problems and data from Wooldridge Introductory Econometrics: A Modern Approach, 4e.

Homework 1

Problem C1.2

(i) How many women are in the sample, and how many report smoking during pregnancy?

There are 1388 women in the sample.

(ii) What is the average number of cigarettes smoked per day? Is the average a good measure of the “typical” woman in this case? Explain.

The sample mean of cigarettes smoked per day is 2.09. This, however, is not representative of a typical woman in the survey, as illustrated by the histogram

Note that the majority of women in this sample did not smoke at all. Thus, the typical woman in this sample is a non-smoker.

(iii) Among women who smoked during pregnancy, what is the average number of cigarettes smoked per day? How does this compare with your answer from part (ii), and why?

There are 212 smokers in the sample. Considering only the smokers, the mean number of cigarettes smoked per day is 13.67. This is very different from the sample mean because 85% of the women in the sample are non-smokers.

(iv) Find the average of fatheduc in the sample. Why are there only 1,192 observations used to compute this average?

The average education level for the fathers is 13.19. There are 196 missing values in this column, meaning that 196 mothers did not disclose the education level of the father of the fetus. Possible reaons for this include not knowing who the father is, not having this information about the father, and not wanting to disclose anything about the father.

(v) Report the average family income and its standard deviation in dollars.

The average sample family income is $29,000 with a standard deviation of $18,740. This, however, may not be a good indication of the sample, as illustrated by this histogram

Note that faminc is $65k for 192 observations. This is probably the result of $65k being the highest category, with the meaning, “greater than or equal to $65k”. Similarly, the lowest faminc is $1k, probably meaning “$1k or less”.

References

Problems and data from Wooldridge Introductory Econometrics: A Modern Approach, 4e.

R Program

Here’s the R program I used to answer these questions

# 
#  28 Aug 2012 D.S.Dixon
# 
# This is the dataset for the first EC 460 homework assignment
#

# the description from is from BWGHT.DES
#  1. faminc                   1988 family income, $1000s
#  2. cigtax                   cig. tax in home state, 1988
#  3. cigprice                 cig. price in home state, 1988
#  4. bwght                    birth weight, ounces
#  5. fatheduc                 father's yrs of educ
#  6. motheduc                 mother's yrs of educ
#  7. parity                   birth order of child
#  8. male                     =1 if male child
#  9. white                    =1 if white
# 10. cigs                     cigs smked per day while preg
# 11. lbwght                   log of bwght
# 12. bwghtlbs                 birth weight, pounds
# 13. packs                    packs smked per day while preg
# 14. lfaminc                  log(faminc)
#

mydata <- read.table("BWGHT.raw",  header = FALSE, na.strings = ".", col.names=c(
        "faminc",
        "cigtax",
        "cigprice",
        "bwght",
        "fatheduc",
        "motheduc",
        "parity",
        "male",
        "white",
        "cigs",
        "lbwght",
        "bwghtlbs",
        "packs",
        "lfaminc"))

print(paste("There are ", length(mydata$bwght), " samples in the data"))

print(paste("The mean number of cigarettes per day is ", mean(mydata$cigs)))

## make a histogram of cigarettes per day
png("8990a8170515ef9b5730fe9573cf4d6c.png")
hist(mydata$cigs,breaks=50)
dev.off()

print(paste("There are ", length(mydata$cigs[mydata$cigs>0]), " smokers in the sample data"))

print(paste("Considering only smokers, the mean number  of cigarettes per day is ", mean(mydata$cigs[mydata$cigs>0])))

print(paste("The mean of fatheduc is ", mean(na.omit(mydata$fatheduc))))

print(paste("There are ", (length(mydata$bwght) - length(na.omit(mydata$fatheduc)))," samples with missing fatheduc data"))

print(paste("The mean of faminc is ", mean(mydata$faminc)))
print(paste("The standard deviation of faminc is ", sd(mydata$faminc)))

## make a histogram of family income
png("d4f18cdd2204524640311f5ab124e086.png")
hist(mydata$faminc,breaks=50)
dev.off()