Statistics 211 Project
Due Date: May 4, 2017
1 Project Background
Zipfs law states that given some corpus of natural language utterances, the frequency of
any word is inversely proportional to its rank in the frequency table. Suppose we counted
the number of times each word was used in the written works by Shakespeare, Alexander
Hamilton, James Madison, or some other author with a substantial written record. Can we
say anything about the frequencies of the most common words?
We let fi be the rate per 1000 words of text for the ith most frequent word used. The
American linguist George Zipf (1902-1950) observed an interesting law like relationship be-
tween rate and rank,
E[fi|i] =
a
ib
, (1)
and further observed that the exponent b 1. Taking logarithms of both sides, we have
approximately,
E[Yi|i] = c b log(i), (2)
where Yi = log(fi) and c = log(a).
Zipfs law has been applied to frequencies of many other classes of objects besides words,
such as the frequency of visits to web pages on the internet and the frequencies of species of
insects in an ecosystem.
2 The Data
The data in words.txt give the frequencies of words in works from four different sources:
the political writings of eighteenth-century American political figures Alexander Hamilton,
James Madison, and John Jay, and the book Ulysses by twentieth-century Irish writer James
Joyce. The data are from Mosteller and Wallace (1964, Table 8.1-1), and give the frequencies
of 165 very common words. Several missing values occur in the data; these are really words
that were used so infrequently that their count was not reported in Mosteller and Wallaces
table. The following table provides a description of the variables in the data set words.txt.
Word The word
Hamilton Rate per 1000 words of this word in the writings of Alexander Hamilton
HamiltonRank Rank of this word in Hamiltons writings
Madison Rate per 1000 words of this word in the writings of James Madison
MadisonRank Rank of this word in Madisons writings
Jay Rate per 1000 words of this word in the writings of John Jay
JayRank Rank of this word in Jays writings
Ulysses Rate per 1000 words of this word in Ulysses by James Joyce
UlyssesRank Rank of this word in Ulysses
1
3 Assignment
There are four parts to this assignment. You will complete this assignment in pairs and
submit one report.
Part 1: Using only the 50 most frequent words in Hamiltons work (that is, using
only rows in the data for which HamiltonRank50), draw the appropriate summary graph,
estimate the mean function in (2), construct a 95% confidence interval for b, and summarize
your results.
Part 2: Use the following residual bootstrap method to construct a confidence interval
for b. Let c and b be the least squares estimators for c and b in equation (2) respectively.
Compute the residuals as
ei = Yi c+ b log(i), 1 i 50. (3)
Draw 50 samples {e1, . . . , e50} with replacement from the residuals {e1, . . . , e50}. Given
{e1, . . . , e50}, construct the bootstrap sample through
Y i = c b log(i) + e
i , 1 i 50. (4)
Run a linear regression based on the bootstrap sample {Y i , log(i)}50i=1 and let b be the
corresponding least squares estimator for b. Repeat the above procedure 1000 times and sort
the bootstrap estimators for b from the smallest to the largest as:
b1 b
2 b
999 b
1000.
Then, the 95% bootstrap confidence interval for b is defined as [b25, b
975]. Compare this
interval with the confidence interval obtained in Part 1. Make a histogram of the bi values
and include it in your report.
Part 3: Repeat Part 1, but for words with rank of 75 or less, and with rank less than
100. For larger number of words, Zipfs law may break down. Does that seem to happen
with these data?
Part 4: Recall that c and b are the least squares estimators in Part 2. Denote by 2 the
sample variance of the residuals {e1, . . . , e50}. Generate simulated data from the following
model,
Yi = c b log(i) + ei, (5)
where ei follows the normal distribution with mean zero and variance
2. Based on the
simulated data (Yi, log(i)), test the null hypothesis H0 : b = b versus the alternative that
Ha : b 6= b at the 5% level. Note that the true slope is indeed equal to b (i.e., the null
hypothesis is true) according to the way the data are generated. Simulate 1000 data sets
according to (5) and count the number of times that the true null hypothesis is being rejected.
Report your finding. Now if you test the null H 0 : b = 0.96 versus the alternative that
H a : b 6= 0.96, what is the number of rejections among the 1000 simulated data sets.
2
4 Submission and Grading
Your printed report (along with your R codes) is due on May 4, 2017. Partners submit one
printed report. No late reports will be accepted. Both partners in the pair will receive the
same grade. The score is out of 100 points. 20 points will be assigned for each part, plus an
additional 20 points based on the presentation of results and your R codes. Make sure your
report is professional with correct punctuation and spelling. Omit needless words.
5 Some Useful R Functions
1. Read the data: read.table(words.txt, header = TRUE, sep = )
2. Linear regression: fit <- lm(YX)3. Summarize the outputs from the least squares fit: summary(fit)4. Construct a 95% confidence interval: confint(fit, X, level=0.95)5. Sampling with replacement: sample(1:n,n,replace=TRUE)6. Using the for loop to run a simulation: for(i in 1:1000){…}3
Reviews
There are no reviews yet.