In this assignment you will experiment with random variation over discrete events.
It will be very helpful to use the analytical results and the experimental results to help verify the other is
correct. If they do not align, you are probably doing something wrong (this is a very powerful and important
thing to do whenever working with real data).
As usual, it is highly recommended that you use LaTeX for this assignment. If you do not, you may
lose points if your assignment is difficult to read or hard to follow. Find a sample form in this directory:
https://www.cs.utah.edu/˜jeffp/teaching/latex/
Consider a domain of size n = 3000.
A: (5 points) Generate random numbers in the domain [n] until two have the same value. How many
random trials did this take? We will use k to represent this value.
B: (10 points) Repeat the experiment m = 300 times, and record for each how many random trials this
took. Plot this data as a cumulative density plot where the x-axis records the number of trials required k,
and the y-axis records the fraction of experiments that succeeded (a collision) after k trials. The plot should
show a curve that starts at a y value of 0, and increases as k increases, and eventually reaches a y value of 1.
C: (5 points) Empirically estimate the expected the number of k random trials in order to have a collision.
That is, add up all values k, and divide by m.
D: (10 points) Describe how you implemented this experiment and how long it took for m = 300 trials.
Show a plot of the run time as you gradually increase the parameters n and m. (For at least 3 fixed values
of m between 300 and 10,000, plot the time as a function of n.) You should be able to reach values of
n = 1,000,000 and m = 10,000.
Consider a domain [n] of size n = 100.
A: (5 points) Generate random numbers in the domain [n] until every value i ∈ [n] has had one random
number equal to i. How many random trials did this take? We will use k to represent this value.
B: (10 points) Repeat step A for m = 300 times, and for each repetition record the value k of how many
random trials we required to collect all values i ∈ [n]. Make a cumulative density plot as in 1.B.
C: (5 points) Use the above results to calculate the empirical expected value of k.
D: (10 points) Describe how you implemented this experiment and how long it took for n = 100 and
m = 300 trials.
Show a plot of the run time as you gradually increase the parameters n and m. (For at least 3 fixed values
of m between 300 and 5,000, plot the time as a function of n.) You should be able to reach n = 20,000 and
m = 5,000.
A: (12 points) Calculate analytically (using formulas from the notes) the number of random trials needed
so there is a collision with probability at least 0.5 when the domain size is n = 3000. (Show your work.)
How does this compare to your results from Q1.C?
B: (12 points) Calculate analytically (using formulas from the notes) the expected number of random
trials before all elements are witnessed in a domain of size n = 100? (Show your work.)
How does this compare to your results from Q2.C?
Consider when the only random function you have is one that choses a bit at random. In particular rand-bit()
returns 0 or 1 at uniformly random.
A: (6 points) How can you use this to create a random integer number between 1 and n = 1024?
B: (5 points) Describe a Las Vegas randomized algorithm (“Las Vegas” means: it may run for ever, but
you can bound how long you expect it to take, and when it finishes you know it is done) for when n = 1000.
C: (5 points) How many calls does this take in terms of n (say I were to increase the value n, how does
the number of calls to rand-bit() change)?
Keep in mind that in many settings generating a random bit is much more expensive that a standard CPU
operation (like an addition, if statement, or a memory reference), so it is critical to minimize them.
Consider a domain size n and let k be the number of random trials run. Let fi denote the number of trials
that have value i. Note that for each i ∈ [n] we have E[fi
] = k/n. Let µ = maxi∈[n] fi/k.
Consider some parameter δ ∈ (0, 1). How large does k need to be for Pr[|µ − 1/n| ≥ 0.01] ≤ δ? That
is, how large does k need to be for all counts to be within 1% of the average with probability δ?
How does this change if we want Pr[|µ − 1/n| ≥ 0.001] ≤ δ (for 0.1% accuracy)?
(Make sure to show your work)
In this assignment you will explore the use of k-grams, Jaccard distance, min hashing, and LSH in the
context of document similarity.
• https://www.cs.utah.edu/˜jeffp/teaching/cs5140/A2/D1.txt
• https://www.cs.utah.edu/˜jeffp/teaching/cs5140/A2/D2.txt
• https://www.cs.utah.edu/˜jeffp/teaching/cs5140/A2/D3.txt
• https://www.cs.utah.edu/˜jeffp/teaching/cs5140/A2/D4.txt
As usual, it is highly recommended that you use LaTeX for this assignment. If you do not, you may
lose points if your assignment is difficult to read or hard to follow. Find a sample form in this directory:
https://www.cs.utah.edu/˜jeffp/teaching/latex/
You will construct several types of k-grams for all documents. All documents only have at most 27 characters: all lower case letters and space.
[G1] Construct 2-grams based on characters, for all documents.
[G2] Construct 3-grams based on characters, for all documents.
[G3] Construct 3-grams based on words, for all documents.
Remember, that you should only store each k-gram once, duplicates are ignored.
A: (20 points) How many distinct k-grams are there for each document with each type of k-gram? You
should report 4 × 3 = 12 different numbers.
B: (20 points) Compute the Jaccard similarity between all pairs of documents for each type of k-gram.
You should report 3 × 6 = 18 different numbers.
We will consider a hash family H so that any hash function h ∈ H maps from h : {k-grams} → [m] for m
large enough (I suggest over m ≥ 10,000).
A: (25 points) Using grams G2, build a min-hash signature for document D1 and D2 using t = {10, 50, 100, 250, 500}
hash functions. For each value of t report the approximate Jaccard similarity between the pair of documents
D1 and D2, estimating the Jaccard similarity:
JSˆ
t(a, b) = 1
t
X
t
i=1 (
1 if ai = bi
0 if ai 6= bi
.
You should report 5 numbers.
B: (5 point) What seems to be a good value for t? You may run more experiments. Justify your answer in
terms of both accuracy and time.
Consider computing an LSH using t = 100 hash functions. We want to find all documents which have
Jaccard similarity above τ = .4.
A: (8 points) Use the trick mentioned in class and the notes to estimate the best values of hash functions
b within each of r bands to provide the S-curve f(s) = 1 − (1 − s
r
)
b
f(s) = 1 − (1 − s
b
)
r
with good separation at τ . Report these values.
The values of r and b we mixed up before. You can report either, but please be clear which means the #
of bands (in blue = r) and the # number of hashes per band (in blue = b). This is now consistent with the
notes, but reverse of the book.
B: (24 points) Using your choice of r and b and f(·), what is the probability of each pair of the four
documents (using [G2]) for being estimated to having similarity greater that τ ? Report 6 numbers. (Show
your work.)
Describe a scheme like Min-Hashing for the Andberg Similarity, defined Andb(A, B) = |A∩B|
|A∪B|+|A4B|
. So
given two sets A and B and family of hash functions, then Prh∈H[h(A) = h(B)] = Andb(A, B). Note the
only randomness is in the choice of hash function h from the set H, and h ∈ H represents the process of
choosing a hash function (randomly) from H. The point of this question is to design this process, and show
that it has the required property.
Or show that such a process cannot be done.
In this assignment you will explore clustering: hierarchical and point-assignment. You will also experiment
with high dimensional data.
• https://www.cs.utah.edu/˜jeffp/teaching/cs5140/A3/C1.txt
• https://www.cs.utah.edu/˜jeffp/teaching/cs5140/A3/C2.txt
• https://www.cs.utah.edu/˜jeffp/teaching/cs5140/A3/C3.txt
These data sets all have the following format. Each line is a data point. The lines have either 3 or 6 tab
separated items. The first one is an integer describing the index of the points. The next 2 (or 5 for C3) are
the coordinates of the data point. C1 and C2 are in 2 dimensions, and C3 is in 5 dimensions. C1 should have
n=20 points, C2 should have n=1004 points, and C3 should have n=1000 points. We will always measure
distance with Euclidean distance.
As usual, it is highly recommended that you use LaTeX for this assignment. If you do not, you may
lose points if your assignment is difficult to read or hard to follow. Find a sample form in this directory:
https://www.cs.utah.edu/˜jeffp/teaching/latex/
There are many variants of hierarchical clustering; here we explore 3. The key difference is how you measure
the distance d(S1, S2) between two clusters S1 and S2.
Single-Link: measures the shortest link d(S1, S2) = min
(s1,s2)∈S1×S2
ks1 − s2k2.
Complete-Link: measures the longest link d(S1, S2) = max
(s1,s2)∈S1×S2
ks1 − s2k2.
Mean-Link: measures the distances to the means. First compute a1 =
1
|S1|
P
s∈S1
s and a2 =
1
|S2|
P
s∈S2
s then
d(S1, S2) = ka1 − a2k2 .
A (25 points): Run all hierarchical clustering variants on data set C1.txt until there are k = 4 clusters,
and report the results as sets.
Which variant did the best job, and which was the easiest to compute (think if the data was much larger)?
Explain your answers.
Point assignment clustering works by assigning every point x ∈ X to the closest cluster centers C. Let
φC : X → C be this assignment map so that φC(x) = arg minc∈C d(x, c). All points that map to the same
cluster center are in the same cluster.
Two good heuristics for these types of cluster are the Gonzalez (Algorithm 9.4.1) and k-Means++
(Algorithm 10.1.2) algorithms.
A: (20 points) Run Gonzalez and k-Means++ on data set C2.txt for k = 3. To avoid too much
variation in the results, choose c1 as the point with index 1.
Report the centers and the subsets for Gonzalez. Report:
• the 3-center cost maxx∈X d(x, φC(x)) and
• the 3-means cost P
x∈X(d(x, φC(x)))2
For k-Means++, the algorithm is randomized, so you will need to report the variation in this algorithm.
Run it several trials (at least 20) and plot the cumulative density function of the 3-means cost. Also report
what fraction of the time the subsets are the same as the result from Gonzalez.
B: (20 points) Recall that Lloyd’s algorithm for k-means clustering starts with a set of k centers C and
runs as described in Algorithm 10.1.1.
• Run Lloyds Algorithm with C initially with points indexed {1,2,3}. Report the final subset and the
3-means cost.
• Run Lloyds Algorithm with C initially as the output of Gonzalez above. Report the final subset and
the 3-means cost.
• Run Lloyds Algorithm with C initially as the output of each run of k-Means++ above. Plot a cumulative density function of the 3-means cost. Also report the fraction of the trials that the subsets are
the same as the input.
C: (10 points) Consider a set of points S ⊂ R
d
and d the Euclidean distance. Prove that
arg min
p∈Rd
X
x∈S
(d(x, p))2 =
1
|S|
X
x∈S
x.
Here are some suggested steps to follow towards the proof (note there are also other valid ways to prove
this, but, for instance, achieving some of these steps will get you partial credit):
1. First prove the same results for S ∈ R
1
.
2. Expand each term (d(x, p))2 = (x − p)
2 = x
2 + p
2 − 2xp.
3. Add the above terms together and take the first derivative.
4. Show the results for each dimension can be solved independently (use properties of edge lengths in a
right triangle).
The k-median clustering problem on a data set P is to find a set of k-centers C = {c1, c2, . . . , ck} to
minimize Cost1(P, C) = P
p∈P d(p, φC(p)). We did not explicitly talk much about this formulation in
class, but the techniques to solve it are all typically extensions of approaches we did talk about. This
problem will be more open-ended, and will ask you to try various approaches to solve this problem. We will
use data set C3.txt.
A: (20 points) Find a set of 4 centers C = {c1, c2, c3, c4} for the 4-medians problem on dataset C3.txt.
Report the set of centers, as well as Cost1(P, C). The centers should be in the write-up you turn in, but also
in a file formatted the same was as the input so we can verify the cost you found. That is each line has 1
center with 6 tab separated numbers. The first being the index (e.g., 1, 2, 3 or 4), and the next 5 being the
5-dimensional coordinates of that center.
Your score will be based on how small a Cost1(P, C) you can find. You can get 15 points for reasonable
solution. The smallest found score in the class will get all 20 points. Other scores will obtain points in
between.
Very briefly describe how you found the centers.
B: (5 points) Run your algorithm again for the 5-medians problem on dataset C3.txt. Report the set of
5 centers and the Cost1(P, C). You do not need to turn in a file for these, just write it in your report.
Recall that the k-center problem is to find a set of k centers C to minimize
Cost0(P, C) = max
p∈P
min
c∈C
d(p, c).
Let C
∗ be the optimal choice of k centers for the k-center problem, and let V
∗ = Cost0(P, C∗
).
Prove that the Gonzalez algorithm always finds a set of k centers C such that
Cost0(P, C) ≤ 2V
∗
In this assignment you will explore finding frequent items in data sets, with emphasis on techniques designed
to work at enormous scale.
• https://www.cs.utah.edu/˜jeffp/teaching/cs5140/A4/S1.txt
• https://www.cs.utah.edu/˜jeffp/teaching/cs5140/A4/S2.txt
These data sets each describe a set of 2000 characters, separated by spaces. The order of the file represents
the order of the stream.
As usual, it is highly recommended that you use LaTeX for this assignment. If you do not, you may
lose points if your assignment is difficult to read or hard to follow. Find a sample form in this directory:
https://www.cs.utah.edu/˜jeffp/teaching/latex/
A (20 points): Run the Misra-Gries Algorithm (see L12.2.2) with (k − 1) = 9 counters on streams S1
and S2. Report the output of the counters at the end of the stream.
In each stream, from just the counters, report how many objects might occur more than 20% of the time,
and which must occur more than 20% of the time.
B (20 points): Build a Count-Min Sketch (see L12.2.3) with k = 10 counters using t = 5 hash functions.
Run it on streams S1 and S2.
For both streams, report the estimated counts for objects a, b, and c. Just from the output of the sketch,
which of these objects, with probably 1 − δ = 31/32, might occur more than 20% of the time?
C (5 points): How would these algorithms need to change (to answer the same questions) if each object
of the stream was a “word” seen on Twitter, and the stream contained all tweets?
D (5 points): Name one advantage of Count-Min Sketch over the Misra-Gries algorithm.
The exact heavy-hitter problem is as follows: return all objects that occur more than 10% of the time. It
cannot return any false positives or any false negatives. In the streaming setting, this requires Ω(min{m, n})
space if there are n objects that can occur and the stream is of length m.
A: (1 point) A 2-Pass streaming algorithm is one that is able to read all of the data in-order exactly twice,
but still only has limited memory. Describe a small space algorithm to solve the exact heavy hitter problem
(say for φ = 10% threshold).
In this assignment you will explore regression techniques on high-dimensional data.
• https://www.cs.utah.edu/˜jeffp/teaching/cs5140/A5/A.dat
• https://www.cs.utah.edu/˜jeffp/teaching/cs5140/A5/X.dat
• https://www.cs.utah.edu/˜jeffp/teaching/cs5140/A5/Y.dat
• https://www.cs.utah.edu/˜jeffp/teaching/cs5140/A5/M.dat
• https://www.cs.utah.edu/˜jeffp/teaching/cs5140/A5/W.dat
and a file stub:
• https://www.cs.utah.edu/˜jeffp/teaching/cs5140/A5/FD.m
These data sets are in matrix format and can be loaded into MATLAB or OCTAVE. By calling
load filename (for instance load X.dat)
it will put in memory the data in the file, for instance in the above example the matrix X. You can then
display this matrix by typing
X
As usual, it is highly recommended that you use LaTeX for this assignment. If you do not, you may
lose points if your assignment is difficult to read or hard to follow. Find a sample form in this directory:
https://www.cs.utah.edu/˜jeffp/teaching/latex/
First we will compute the SVD of the matrix A we have loaded
[U,S,V] = svd(A)
Then take the top k components of A for values of k = 1 through k = 10 using
Uk = U(:,1:k)
Sk = S(1:k,1:k)
Vk = V(:,1:k)
Ak = Uk*Sk*Vk’
A (10 points): Compute and report the L2 norm of the difference between A and Ak for each value of k
using
norm(A-Ak,2)
B (5 points): Find the smallest value k so that the L2 norm of A-Ak is less than 10% that of A; k might
or might not be larger than 10.
C (5 points): Treat the matrix as 1125 points in 30 dimensions. Plot the points in 2 dimensions in the way
that minimizes the sum of residuals squared.
Use the stub file FD.m to create a function for the Frequent Directions algorithm (Algorithm 16.2.1). We
will consider running this code on matrix A.
A (15 points): We can measure the error maxkxk=1 |kAxk
2 − kBxk
2
| as norm(A’*A – B’*B, 2).
How large does l need to be for the above error to be at most kAk
2
F
/10? How does this compare to the
theoretical bound (e.g. for k = 0).
Note: you can calculate kAk
2
F
as norm(A, ’fro’)ˆ2.
B (25 points): Frequent Directions should also satisfy another bound based on its Frobenious norm.
We can compute AΠBk
using Bk = B(1:k,:) and then calculating A * pinv(Bk’) * Bk A *
pinv(Bk) * Bk. How large does l need to be to achieve
kA − AΠBk
k
2
F ≤ 1.1 · kA − Akk
2
F
;
for each value k ∈ {1, 2, 3, 4, 5, 6, 7}. Answer both by running your algorithm and reporting the theoretical
bound provided in the notes. (e.g., you should report 7 pairs of values, an empirical and theoretical bound
for each value k)
We will find coefficients C (was a1, . . . , ad in notes, but changed to avoid confusion with matrix A in Q1)
to estimate X*C ≈ Y, using the provided datasets X and Y. We will compare two approaches least squares
and ridge regression.
Least Squares: Set A = inverse(X’ * X)*X’*Y
Ridge Regression: Set As = inverse(X’*X + sˆ2*eye(12))*X’*Y
A (20 points): Solve for the coefficients C (or Cs) using Least Squares and Ridge Regression with
s = {0.1, 0.3, 0.5, 1.0, 2.0} (i.e. s will take on one of those 5 values each time you try, say obtaining C2 for
s = 2). For each set of coefficients, report the error in the estimate Yˆ of Y as norm(Y – X*C,2).
B (20 points): Create three row-subsets of X and Y
• X1 = X(1:66,:) and Y1 = Y(1:66)
• X2 = X(34:100,:) and Y2 = Y(34:100)
• X3 = [X(1:33,:); X(67:100,:)] and Y3 = [Y(1:33); Y(67:100)]
Repeat the above procedure on these subsets and cross-validate the solution on the remainder of X and
Y. Specifically, learn the coefficients C using, say, X1 and Y1 and then measure norm(Y(67:100) –
X(67:100,:)*C,2).
Which approach works best (averaging the results from the three subsets): Least Squares, or for which
value of s using Ridge Regression?
Consider a linear equation W = M*S where M is a measurement matrix filled with random values {−1, 0, +1}
(although now that they are there, they are no longer random), and W is the output of the sparse signal S
when measured by M.
Use Orthogonal Matching Pursuit (as described in the notes) to recover the non-zero entries from S.
Record the order in which you find each entry and the residual vector after each step.
In this assignment you will explore different approaches to analyzing Markov chains.
• https://www.cs.utah.edu/˜jeffp/teaching/cs5140/A6/M.dat
These data sets are in matrix format and can be loaded into MATLAB or OCTAVE. By calling
load filename (for instance load M.dat)
it will put in memory the the data in the file, for instance in the above example the matrix M. You can then
display this matrix by typing M
As usual, it is highly recommended that you use LaTeX for this assignment. If you do not, you may
lose points if your assignment is difficult to read or hard to follow. Find a sample form in this directory:
https://www.cs.utah.edu/˜jeffp/teaching/latex/
We will consider four ways to find q∗ = Mt
q0 as t → ∞.
Matrix Power: Choose some large enough value t, and create Mt
. Then apply q∗ = (Mt
)q0. There are two ways to
create Mt
, first we can just let Mi+1 = Mi ∗ M, repeating this process t − 1 times. Alternatively,
(for simplicity assume t is a power of 2), then in log2
t steps create M2i = Mi ∗ Mi
.
State Propagation: Iterate qi+1 = M ∗ qi for some large enough number t iterations.
Random Walk: Starting with a fixed state q0 = [0, 0, . . . , 1, . . . , 0, 0]T where there is only a 1 at the ith entry, and
then transition to a new state with only a 1 in the i
0
th entry by choosing a new location proportional
to the values in the ith column of M. Iterate this some large number t0 of steps to get state q
0
0
. (This
is the burn in period.)
Now make t new step starting at q
0
0
and record the location after each step. Keep track of how many
times you have recorded each location and estimate q∗ as the normalized version (recall kq∗k1 = 1)
of the vector of these counts.
Eigen-Analysis: Compute eig(M) and take the first eigenvector after it has been normalized.
A (20 points): Run each method (with t = 512, q0 = [1, 0, 0, . . . , 0]T
and t0 = 50 when needed) and
report the answers.
B (10 points): Rerun the Matrix Power and State Propagation techniques with q0 = [0.1, 0.1, . . . , 0.1]T
.
For what value of t is required to get as close to the true answer as the older initial state?
C (16 points): Explain at least one Pro and one Con of each approach. The Pro should explain a situation
when it is the best option to use. The Con should explain why another approach may be better for some
situation.
D (4 points): Is the Markov chain ergodic? Explain why or why not.
CS 6140 Data Mining; Spring 2014 Instructor: Jeff M. Phillips, University of Utah
Repeat the trials in part 1.A above using taxation β = 0.85 so at each step, with probability 1 − β, any
state jumps to a random node. It is useful to see how the outcome changes with respect to the results from
Question 1. Recall that this output is the PageRank vector of the graph represented by M.
Briefly explain (no more than 2 sentences) what you needed to do in order to alter the process in question
1 to apply this taxation.
Reviews
There are no reviews yet.