Data 100 & 200A Principles and Techniques of Data Science
Spring 2019
INSTRUCTIONS
You have 70 minutes to complete the exam.
Midterm 1 Solutions
The exam is closed book, closed notes, closed computer, closed calculator, except one hand-written 8.5 11 crib sheet of your own creation and the official Data 100 study guide.
Mark your answers on the exam itself. We will not grade answers written on scratch paper.
Last name
First name
Student ID number
CalCentral email ([email protected])
Exam room
Name of the person to your left
Name of the person to your right
All the work on this exam is my own.
(please sign)
2
1. (12 points) Python vs. SQL
Fill in both the Python code and the SQL query to produce each result below, assuming that the following three tables are stored both as Pandas DataFrames and Sqlite tables. Only the first few rows are shown for each table. The cities table contains one row per city and its population in thousands. The names table contains one row per state. The states table contains one row per state with its population in millions. Assume that cities contains only a small subset of US cities. There may be multiple cities in each state, but every city is in a state that appears in both names and states, and every state contains at least one city.
cities names states
(a) (4 pt) Create a table t that is the same as cities but with an additional column ab containing the abbreviation (e.g. CA) of the state in which each city is located. Hint: The str.extract method of a Series called on a regular expression with one group returns a DataFrame with one column labeled 0 containing the first substring matching the group. Assume there is exactly one comma in each city value.
Python: t = cities.copy()
city
pop
Nevada City, California Carson City, Nevada Newark, New Jersey
3 55 285
abbrev
full
CA NV WA
California Nevada Washington
state
people
CA NV WA
39.5 3.0 7.4
SQL:
x = cities[city].str.extract(r, (w+))[0]
t[ab] = list(names.set_index(full).loc[x, abbrev])
CREATE TABLE t AS SELECT city, pop, abbrev AS ab FROM cities JOIN names ON
city LIKE %, || full;
(b) (4 pt) Create a two-column table u of the cities and their populations (labeled city and pop) that are in
states with a population above 5 million. Assume that t from part (a) was constructed correctly. Python: m = t.merge(states, left_on=ab, right_on=state)
m[m[people] > 5][[city, pop]]
SQL: CREATE TABLE u AS SELECT city, pop FROM t WHERE
ab IN (SELECT state FROM states WHERE people > 5)
(c) (4 pt) Create a table with one row per state that contains the states abbreviation and the fraction of cities (from the cities table) in that state that have a population above 50,000.
Python: t[pop].groupby(t[ab]).agg( lambda s: sum(s>50)/len(s))
SQL: SELECT ab, SUM(CASE WHEN pop < 50 THEN 0 ELSE 1 END)/COUNT(*) FROM t GROUP BY ab;Name: 3 2. (6 points) SamplingCircle the correct response to each question about this population of six individuals.(a) (2 pt) From the population above, you draw a simple random sample A of 2 individuals. Whats the probability that Ace and Adele both appear in sample A?1 1 1 1 1 1 1 +1 21 1 21 1 2(1 1) 2(1 +1) 66653366 66 65 33 66(b) (2 pt) You separately draw a cluster sample B from the same population based on the cluster column. (Sample A is replaced in the population before drawing sample B, so the two are independent.) Whats the probability that Ace and Adele both appear in sample B?0 1 1 1 1 2 3 Noneofthese 643234(c) (2 pt) You then combine all individuals from sample A and sample B into sample C. Thus, there may be repeated individuals in sample C. Whats the probability that Bella appears exactly once in sample C?0 1 1 1 1 2 5 Noneofthese 6432363. (4 points) Regular Expressions [a-z]+_[a-z]{2}[_r]_?[a-z]+Circle all of the strings below that match the regular expression above. Only circle a string below if the whole string matches the expression, not just a substring.bar_chart group_by_x a_box_plot visualize_firstname clusterAbdul Ace Adele Aerie Bella Buzz Blue Blue Blue Blue Gold Gold 44. (10 points) Data Visualization(a) (2 pt) Are the two histograms below displaying exactly the same data? Circle only one answer.(a) Yes (b) No (c) Impossible to tell 0 5 10 15 20 5 10 15 20 25xxThere is a location shift between the two distributions.(b) (2 pt) Are the two Gaussian kernel density plots below displaying exactly the same data? Circle only one answer.(a) Yes (b) No (c) Impossible to tell4 2 0 2 4 4 2 0 2 4.(c) (2 pt) Which of the following can be determined from looking at this pie chart? Circle only one answer. BThe density plots could be displaying the same data and look very different because of the bandwith, however, one cannot tell for sure.(a) Category B is twice as frequent as category A (b) Category A is half as frequent as category B(c) Category B is more frequent than category A (d) All of the above(e) None of the aboveA With angles/areas, it is very hard to precisely compare frequencies.CDensityFrequency0.0 0.1 0.2 0.3 0.40 10 20 30DensityFrequency0.0 0.2 0.4 0.6 0.80 10 20 30 40Name:5 (d)(2 pt) You have 30 unique observations of a real-valued quantitative variable. Which of the following visualizations effectively depicts the distribution of these values while retaining as much information as possible about the original data? Circle only one answer.(a) A pie chart (b) A strip chart(c) A box plotA strip chart displays all of the data. A boxplot would summarize the data and potentially miss issues such as bimodality. A pie chart is not applicable.(2 pt) You are given six lists, each of a few thousand numbers taking on values in the real line. Which of the following is the most effective way to visually compare the center and spread of the corresponding six distributions? Choose only one answer.(a) Side-by-side box plots (b) Side-by-side histograms(c) Side-by-side strip chartsWith a few thousand values, strip charts would be unreadable. Visually comparing several histograms is difficult.(e)5. (8 points) Dimensionality ReductionYou perform principal component analysis on a data matrix D using the following Python code from lecture: n = D.shape[0] X = (D – np.mean(D, axis=0)) / np.sqrt(n) u, s, vt = np.linalg.svd(X, full_matrices=False)The resulting value of s is np.array([3, 1, 0, 0, 0]).(a) (4 pt) To draw a histogram of the datas distribution along the first principal component of X, which ofthe following arrays would you visualize? Circle all correct expressions.X @ u.T[:,0] (u * s)[:,0] X @ vt[0,:] (X @ vt.T)[:,0](b) (2 pt) What proportion of the total variance in D is accounted for by the first principal component?9 10(c) (2 pt) What is the rank of X?2
Reviews
There are no reviews yet.