Homework 2

Problem 2.2

In the simple linear regression model y = β0 + β1 x + u, suppose that E(u) ≠ 0. Letting α0 = E(u), show that the model can always be rewritten with the same slope, but a new intercept and error, where the new error has a zero expected value.

Let

u = α0 + v

where

E(v) = 0

Now

y = (α0 + β0) + β1 x + v

which has the same slope β1, but intercept α0 + β0, and error v such that E(v) = 0.

Problem 2.4

The data set BWGHT.RAW contains data on births to women in the United States. Two variables of interest are the dependent variable, infant birth weight in ounces (bwght), and an explanatory variable, average number of cigarettes the mother smoked per day during pregnancy (cigs). The following simple regression was estimated using data on n = 1,388 births:

bwght = 119.77 – 0.514 cigs

(i) What is the predicted birth weight when cigs = 0? What about when cigs = 20 (one pack per day)? Comment on the difference.

This model predicts a birth weight of 119.77 ounces with cigs = 0, and 109.49 ounces when cigs = 20. That is, a pack a day lowers birth weight by slightly more than ten ounces.

(ii) Does this simple regression necessarily capture the causal relationship between the child’s birth weight and the mother’s smoking habits? Explain.

Birth weight is likely to be a function of many factors. Smoking may be one of those, or it may just be correlated other factors. That is, this relationship does not capture a causal relationship.

(iii) To predict a birth weight of 125 ounces, what would cigs have to be? Comment.

For this model to predict a birth weight of 125 ounces, cigs would have to be about -10.18. A birth weight of 125 ounces, therefore, is probably in the residual for cigs = 0.

(iv) The proportion of women in the sample who do not smoke while pregnant is about 0.85. Does this help reconcile your finding from part (iii)?

For 85% of the women in this study, smoking does not predict any of the variation in birth weight, supporting the notion that a birth weight of 125 ounces is in the residual for cigs = 0.

References

Problems and data from Wooldridge Introductory Econometrics: A Modern Approach, 4e.

Homework 1

Problem C1.2

(i) How many women are in the sample, and how many report smoking during pregnancy?

There are 1388 women in the sample.

(ii) What is the average number of cigarettes smoked per day? Is the average a good measure of the “typical” woman in this case? Explain.

The sample mean of cigarettes smoked per day is 2.09. This, however, is not representative of a typical woman in the survey, as illustrated by the histogram

Note that the majority of women in this sample did not smoke at all. Thus, the typical woman in this sample is a non-smoker.

(iii) Among women who smoked during pregnancy, what is the average number of cigarettes smoked per day? How does this compare with your answer from part (ii), and why?

There are 212 smokers in the sample. Considering only the smokers, the mean number of cigarettes smoked per day is 13.67. This is very different from the sample mean because 85% of the women in the sample are non-smokers.

(iv) Find the average of fatheduc in the sample. Why are there only 1,192 observations used to compute this average?

The average education level for the fathers is 13.19. There are 196 missing values in this column, meaning that 196 mothers did not disclose the education level of the father of the fetus. Possible reaons for this include not knowing who the father is, not having this information about the father, and not wanting to disclose anything about the father.

(v) Report the average family income and its standard deviation in dollars.

The average sample family income is $29,000 with a standard deviation of $18,740. This, however, may not be a good indication of the sample, as illustrated by this histogram

Note that faminc is $65k for 192 observations. This is probably the result of $65k being the highest category, with the meaning, “greater than or equal to $65k”. Similarly, the lowest faminc is $1k, probably meaning “$1k or less”.

References

Problems and data from Wooldridge Introductory Econometrics: A Modern Approach, 4e.

R Program

Here’s the R program I used to answer these questions

# 
#  28 Aug 2012 D.S.Dixon
# 
# This is the dataset for the first EC 460 homework assignment
#

# the description from is from BWGHT.DES
#  1. faminc                   1988 family income, $1000s
#  2. cigtax                   cig. tax in home state, 1988
#  3. cigprice                 cig. price in home state, 1988
#  4. bwght                    birth weight, ounces
#  5. fatheduc                 father's yrs of educ
#  6. motheduc                 mother's yrs of educ
#  7. parity                   birth order of child
#  8. male                     =1 if male child
#  9. white                    =1 if white
# 10. cigs                     cigs smked per day while preg
# 11. lbwght                   log of bwght
# 12. bwghtlbs                 birth weight, pounds
# 13. packs                    packs smked per day while preg
# 14. lfaminc                  log(faminc)
#

mydata <- read.table("BWGHT.raw",  header = FALSE, na.strings = ".", col.names=c(
        "faminc",
        "cigtax",
        "cigprice",
        "bwght",
        "fatheduc",
        "motheduc",
        "parity",
        "male",
        "white",
        "cigs",
        "lbwght",
        "bwghtlbs",
        "packs",
        "lfaminc"))

print(paste("There are ", length(mydata$bwght), " samples in the data"))

print(paste("The mean number of cigarettes per day is ", mean(mydata$cigs)))

## make a histogram of cigarettes per day
png("8990a8170515ef9b5730fe9573cf4d6c.png")
hist(mydata$cigs,breaks=50)
dev.off()

print(paste("There are ", length(mydata$cigs[mydata$cigs>0]), " smokers in the sample data"))

print(paste("Considering only smokers, the mean number  of cigarettes per day is ", mean(mydata$cigs[mydata$cigs>0])))

print(paste("The mean of fatheduc is ", mean(na.omit(mydata$fatheduc))))

print(paste("There are ", (length(mydata$bwght) - length(na.omit(mydata$fatheduc)))," samples with missing fatheduc data"))

print(paste("The mean of faminc is ", mean(mydata$faminc)))
print(paste("The standard deviation of faminc is ", sd(mydata$faminc)))

## make a histogram of family income
png("d4f18cdd2204524640311f5ab124e086.png")
hist(mydata$faminc,breaks=50)
dev.off()

Full Regression Report

Prof. Dixon’s Econometrics Journal

A Full Regression Report

Here’s a way to get R regression output in a neat table via a file

Source | SS df MS

Model | 8.733634 5 1.746727
Residual | 1.642729 130 0.01263638

Total | 10.37636 135 0.07686195
Number of obs = 136
F( 5 , 130 ) = 138.23
Prob > F = 1.153922e-50
R-squared = 0.8416854
Adj R-squared = 0.8355964
Root MSE = 0.1124117

lsalary | Coef. Std. Err. t P>|t| [95% Conf. Interval]

LSAT | 0.004696401 0.004010493 1.171028 0.2437294 -0.00323788 0.01263068
GPA | 0.2475247 0.09003704 2.749143 0.006826024 0.06939718 0.4256522
llibvol | 0.09499263 0.03325435 2.856547 0.004988002 0.02920287 0.1607824
lcost | 0.03755438 0.03210608 1.169697 0.2442631 -0.02596366 0.1010724
rank | -0.003324603 0.0003484612 -9.540813 1.120853e-16 -0.004013992 -0.002635214
_cons | 8.343233 0.5325192 15.66748 9.854423e-32 7.289708 9.396759

Here’s a way to get R regression output in a neat table

Source | SS df MS

Model | 8.733634 5 1.746727
Residual | 1.642729 130 0.01263638

Total | 10.37636 135 0.07686195
Number of obs = 136
F( 5 , 130 ) = 138.23
Prob > F = 1.153922e-50
R-squared = 0.8416854
Adj R-squared = 0.8355964
Root MSE = 0.1124117

lsalary | Coef. Std. Err. t P>|t| [95% Conf. Interval]

LSAT | 0.004696401 0.004010493 1.171028 0.2437294 -0.00323788 0.01263068
GPA | 0.2475247 0.09003704 2.749143 0.006826024 0.06939718 0.4256522
llibvol | 0.09499263 0.03325435 2.856547 0.004988002 0.02920287 0.1607824
lcost | 0.03755438 0.03210608 1.169697 0.2442631 -0.02596366 0.1010724
rank | -0.003324603 0.0003484612 -9.540813 1.120853e-16 -0.004013992 -0.002635214
_cons | 8.343233 0.5325192 15.66748 9.854423e-32 7.289708 9.396759

Homework 2

Problem 2.2

In the simple linear regression model , suppose that . Letting , show that the model can always be rewritten with the same slope, but a new intercept and error, where the new error has a zero expected value.

Let

where

Now

which has the same slope , but intercept , and error such that .

Problem 2.4

The data set BWGHT.RAW contains data on births to women in the United States. Two
variables of interest are the dependent variable, infant birth weight in ounces (bwght), and an explanatory variable, average number of cigarettes the mother smoked per day during pregnancy (cigs). The following simple regression was estimated using data on n = 1,388
births:

  1. What is the predicted birth weight when cigs = 0? What about when cigs = 20 (one pack per day)? Comment on the difference.

This model predicts a birth weight of 119.77 ounces with cigs = 0, and 109.49 ounces when cigs = 20. That is, a pack a day lowers birth weight by slightly more than ten ounces.

  1. Does this simple regression necessarily capture the causal relationship between the child’s birth weight and the mother’s smoking habits? Explain.

Birth weight is likely to be a function of many factors. Smoking may be one of those, or it may just accompany other factors. That is, this relationship does not capture a causal relationship.

  1. To predict a birth weight of 125 ounces, what would cigs have to be? Comment.

For this model to predict a birth weight of 125 ounces, cigs would have to be about -10.18. Since the third quartile for bwght is 132 ounces, and the max is 271 ounces, 125 ounces is certainly within the range of the data set. A birth weight of 125 ounces, therefore, is probably in the residual for cigs = 0.

  1. The proportion of women in the sample who do not smoke while pregnant is about 0.85. Does this help reconcile your finding from part 3?

For 85% of the women in this study, smoking does not predict any of the variation in birth weight, supporting the notion that a birth weight of 125 ounces is in the residual for cigs = 0.

Econometrics Blog

Prof. Dixon’s Econometrics Journal

A Discussion of Problem C1.2

(i) How many women are in the sample, and how many report smoking during pregnancy?

The number of women in the sample is

> print(length(mydata$bwght))
[1] 1388

(ii) What is the average number of cigarettes smoked per day? Is the average a good measure of the “typical” woman in this case? Explain.

The sample mean of cigarettes smoked per day is

> print(mean(mydata$cigs))
[1] 2.087176

This, however, is not representative of a typical woman in the survey, as illustrated by the histogram

hist(mydata$cigs,breaks=50)

Note that the majority of women in this sample did not smoke at all. Thus, the typical woman in this sample is a non-smoker.

(iii) Among women who smoked during pregnancy, what is the average number of cigarettes smoked per day? How does this compare with your answer from part (ii), and why?

There are

> print(length(mydata$cigs[mydata$cigs>0]))
[1] 212

smokers in the sample. Considering only the smokers, the mean number of cigarettes smoked per day is

> print(mean(mydata$cigs[mydata$cigs>0]))
[1] 13.66509

This is very different from the sample mean because 85% of the women in the sample are non-smokers.

(iv) Find the average of fatheduc in the sample. Why are there only 1,192 observations used to compute this average?

The average education level for the fathers is

> print(mean(na.omit(mydata$fatheduc)))
[1] 13.18624

There are 196 missing values in this column, meaning that 196 mothers did not disclose the education level of the father of the fetus. Possible reaons for this include not knowing who the father is, not having this information about the father, and not wanting to disclose anything about the father.

(v) Report the average family income and its standard deviation in dollars.

The average sample family income is

> print(mean(mydata$faminc))
[1] 29.02666

thousand dollars with standard deviation

> print(sd(mydata$faminc))
[1] 18.73928

thousand dollars. This, however, may not be a good indication of the sample, as illustrated by this histogram

Note that faminc is $65k for 192 observations. This is probably the result of $65k being the highest category, with the meaning, “greater than or equal to $65k”. Similarly, the lowest faminc is $1k, probably meaning “$1k or less”.

a
+
b