Question 12pts
The following piece of code computes the frequencies of the words in a text file:
from operator import addlines = sc.textFile(README.md)counts = lines.flatMap(lambda x: x.split()) .map(lambda x: (x, 1)) .reduceByKey(add)
Add one line to find the most frequent word. Output this word and its frequency.
Hint: Use sortBy(), reduce(), or max()
Answer:
Using Python version 2.7.12 (default, Dec4 2017 14:50:18)
SparkSession available as spark.
>>> from operator import add
>>> lines = sc.textFile(README.md)
>>> counts = lines.flatMap(lambda x: x.split()).map(lambda x: (x, 1)).reduceByKey(add)
>>> print(counts.map(lambda (k,v): (v,k)).sortByKey(False).take(1))
[(24, uthe)]
Question 22pts
Modify the word count example above, so that we only count the frequencies of those words consisting of 5 or more characters.
Answer:
Using Python version 2.7.12 (default, Dec4 2017 14:50:18)
SparkSession available as spark.
>>> from operator import add
>>> lines = sc.textFile(README.md)
>>> counts = lines.flatMap(lambda x: x.split())
>>> y = counts.filter(lambda word:len(word)>=5)
>>> t = y.map(lambda x:(x, 1))
>>> p = t.reduceByKey(add)
Question 3 1pts
Consider the following piece of code:
A = sc.parallelize(xrange(1, 100))t = 50B = A.filter(lambda x: x < t)print B.count()t = 10C = B.filter(lambda x: x > t)print C.count()
Whats its output? (Yes, you can just run it.)
Answer: Using Python version 2.7.12 (default, Dec4 2017 14:50:18)
SparkSession available as spark.
>>> A = sc.parallelize(xrange(1, 100))
>>> t=50
>>> B = A.filter(lambda x: x < t)>>> print B.count()
49 #output
>>> t=10
>>> C = B.filter(lambda x: x > t)
>>> print C.count()
0 #output
Question 42pts
The intent of the code above is to get all numbers below 50 from A and put them into B, and then get all numbers above 10 from B and put them into C. Fix the code so that it produces the desired behavior, by adding one line of code. You are not allowed to change the existing code.
Answer:
Using Python version 2.7.12 (default, Dec4 2017 14:50:18)
SparkSession available as spark.
>>> A = sc.parallelize(xrange(1, 100))
>>> t=50
>>> B = A.filter(lambda x: x < t)>>> B.cache()
PythonRDD[1] at RDD at PythonRDD.scala:48
>>> print B.count()
49
>>> t=10
>>> C = B.filter(lambda x: x > t)
>>> print C.count()
39
Question 53pts
Modify the PMI example by sending a_dict and n_dict inside the closure. Do not use broadcast variables.
Answer:
Reviews
There are no reviews yet.