idss_lab3_probabilit
Introduction to Data Science and Systems
Lab 3: Introduction to Probability
Copyright By Assignmentchef assignmentchef
you should submit this notebook on Moodle along with one pdf file (see the end of the notebook and Moodle for instructions)
University of Glasgow,JHW (amended by BSJ and NP) 2022
ewcommand{vec}[1]{ {bf #1}}
ewcommand{real}{mathbb{R}}
DeclareMathOperator*{argmin}{arg,min}
ewcommand{expect}[1]{mathbb{E}[#1]}
Purpose of this lab
This lab should help you:
understand basic probability including probability mass functions and probability density functions
understand how to manipulate marginal, joint and conditional distributions
understand Bayes rule
understand the multivariate normal distribution
understand how optimisation can be used to estimate parameters in a simple statistical model using maximum likelihood estimation
Lab 3 is structured as follows (with two main task sections):
Part 1: Probability with discrete random variables
Part 2: Probability with continuous random variables
We recommend you read through the lab carefully and work through the tasks.
Material and resources
It is recommended to keep the lecture notes (from lecture 4 and lecture 3 in particular) open while doing this lab exercise. and you should, of course, be prepared to access some of the recommended material.
If you are stuck, the following resources are very helpful:NumPy cheatsheet
Num PI reference
NumPy user guide
Marking and Feedback
Note: This lab is marked out of 80 but accounts for the same overall percentage as the other labs.
This assessed lab is marked using three different techniques;
Autograded with feedback; youll get immediate feedback.
Autograded without (immediate) feedback (there will always be a small demo/test so you can be confident that the format of your answer is correct).
Note: auto-graded results are always provisional and subject to change in case there are significant issues (this will usually be in favour of the student).
Help & Assistance
This lab is graded and the lab assistants/lecturer can provide guidance, but we can (and will) not give you the final answer or confirm your result.
Plagiarism
All submissions will be automatically compared against each other to make sure your submission represents an independent piece of work! We have provided a few checks to make sure that is indeed the case.
Before you begin
Please update the tools we use for the automated greading by running the below command (uncomment) and restart your kernel (and then uncomment again) or simply perform the installation externally in an Anaconda/Python prompt.
!pip install -U force-reinstall no-cache https://github.com/johnhw/jhwutils/zipball/master
# the following will allow you to downlad the data files you need (that are otherwise in the zip file)
# you can comment this line after the first run.
#!pip install wget
#!wget https://github.com/pugeault/IDSS2022-23/raw/main/lab3-files.zip
#!unzip lab3-files.zip
Import the basics
# Standard imports
# Make sure you run this cell!
from __future__ import print_function, division
import numpy as np# NumPy
import scipy.stats
import sys
import binascii
from unittest.mock import patch
from uuid import getnode as get_mac
from jhwutils.checkarr import array_hash, check_hash, check_scalar, check_string
import jhwutils.image_audio as ia
import jhwutils.tick as tick
tick.reset_marks()
# special hash funciton
def case_crc(s, verbose=True):
h_crc = binascii.crc32(bytes(s.lower(), ascii))
if verbose:
print(h_crc)
return h_crc
# this command generaties a unique key for your system/computer/account
uuid_simple = ((%s) % get_mac())
uuid_str = (%s
%s
%s
%s
%s
) % (os.path,sys.path,sys.version,sys.version_info,get_mac())
uuid_system = case_crc(uuid_str,verbose=False)
# Set up Matplotlib
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
plt.rc(figure, figsize=(8.0, 4.0), dpi=140)
np.random.seed(2019)
# You can ignore this
import lzma, base64
exec(lzma.decompress(base64.b64decode(b/Td6WFoAAATm1rRGAgAhARYAAAB0L+Wj4AuQBIRdADSbSme4Ujxz95HdWLf4m39SX9b5yuqRd8UVk3DwULgMdEb8P1bLvis2Swz3hlDU+FwGvQQSXUZEv0GMy+ErARv0E4TRCmTvFyQ5JSb/G7uf1mJI1cRyyqS/p8OjxfizueERZipJkqifEM7mgPLl2o+B4OX/p+0Vu3LfTMRZvY//6V0JXZwRxDGVuddVdlsZOuNDNEzsXiiyo2fiXL31w0sXabfigUkJ4q1uo4W7C0THX7Lhb0zVk9A0/+f114CeChR9Gz25xOstbGjOodl1SDpP5rKIxkxZTzcjw65yJRKqid46CTa5ffWK7y9QtygL7IqcGB9ode3TKcTh57Edd9+wylW9BiSE8/qh/93qFavlXK1sjLdoTWEfPZ96HOR6La9g8KEEFNNMAc+7HtTv1JwoX8w+zLayLzdpIqA+HLAVFiVeFg4jPu8imDtmoOhe66WDgPWsXetR6FFCK7mw4q4Q+A4TCz6ugUoh/gIEGjIadianBQVSST062b8vSLzWmFYLKJoRPDHPyvFWKX5u27LZ9Bpswls2feqW6SBEvavwjckRwW5r4Fc6F+MEYgZmAUy8sJRXe7JHvp6LZ3o5RM92eQRoGsDeL82U2LC6sXfxy3MBj6Gd0wwWC9iJdyvs+laSdI41jk2lZUcDpVCBoSV/Zr+0rH1PsT33u2NlfDsaXrG67zKhbB+SSGz3OoN6Kq/1GwWf+GvNH3cySyrJOgN2edwh/fn87XMHk5QZCg0BtZRObATtZAoloB5jJGvjwqtxHCItkTdGoUi4TY75N3FMTPowFYUXn2tjAtngJibqtGbZ/+PaS7E134Lsvxy5o2uaBgoV+U9Mg1poz1QAl0YTKDNMZjVILDbKIRq9e2C4X6e3SWQRW4LrBujBJp7Q8AJjKIspFOLt7PzxOwkSHES90iNgMW4Sn+uTKwQEcTrtTZCDm5Bynn5taepEXp2hj2cmuZEGJCXX8HOM9RgnWyeOVDcUPRCGAGrjA3y7VZGuEjdPE4DT32dmqJabHrPtrc0tgde5UfefS8ezzGOEheOmYQEtpIZLY2TwuNbhOIvIxNfnDA7H7ug1LtCSTejYkGU9CztGzKkyoWEMSTGQSd7aEddrdDS8gsOF6r+RmhCutjGejXFHFtVEcL8FJxczfLdbWDNdBl69IrZ8vlV6Ts4FojBO0/w6HAv24jyX1r+4n3ymPeJZb2SR7HQ/4L2In4ywuUdCkI2t2UuB0fHYgA+ibCVPoXg5Da698PlcozIlD/cmP+3OAnEU+yPElHmLrfjGLFwmWN28ikbluPx0be9B7sn4qTJUY0zrOBuv+wS47A7j5XXicpakCHJcqDaEuzWCa6e1JRmIDoitnr+2kNbGDYNPgKKJE8XDvWVZTgnG1NCGhTZJlTL37hZZIuwkA5RbpnOlrEldKjGnol9D209OuritES1GvlL2H7lDtRTiMnHPHcHMnVqPg5usk3F2Zw23PtC1YDaHvqxgyqaXlRslElFtLz2k9GV2QC3bUxVVlf6jQgPkDoQhKu63JjQtoPRrn0AAR37PsnsFZ74AAaAJkRcAAHaUzMuxxGf7AgAAAAAEWVo=)).decode(ascii))
print(Everything imported OK (ignore deprecated warnings))
# Hidden cell for utils needed when grading (you can/should not edit this)
Mini-task: provide your personal details in two variables:
student_id : a string containing your student id (e.g. 1234567x), must be 8 chars long.
student_typewritten_signature: a string with your name (e.g. ) which serves as a declaration that this is your own work (read the declaration of originality when you submit on Moodle).
student_id = # your 8 char student id
student_typewritten_signature = # your full name, avoid spceical chars if possible
# YOUR CODE HERE
raise NotImplementedError()
Part 1. Hunt the submarine gridworld version
The USS Scorpion has been lost at sea. Your job is to model where it might be probabilistically.
In part 1 of this lab, we will assume that our world is divided into a 2D grid of squares, each of which might contain an errant submarine.
Part 1 Random variables, outcomes and events
We assume that the submarines location is given by a random variable $X$ whose sample space is the space of 2D grid points $vec{x}=[x,y]$ where $x$ and $y$ are integers in the range 0-15 (inclusive). (Be careful: the x,y order is the opposite of the row, column format used in images).
You are given a probability mass function as a 1616 matrix submarine_pmf giving the probability of each grid square, i.e $f_X(vec{x}) = P(X=vec{x})$.
show_pmf(submarine_pmf) # show the PMF on top of a map
Outcomes and events
Compute the the following values, all of which are scalars. The notation [x1, y1] -> [x2, y2] indicates a box that should include x1,y1,x2,y2 (be careful!).
in_1_4 probability submarine is in square [1,4]
out_3_3 probability submarine is not in square [3,3]
west_6 submarine is in square [x<=6, y]y_div_3 submarine has a y coordinate exactly divisible by 3below_above submarine is in square [x, y<10] or [x, y>=13]
box_1 submarine is in the box [1, 4] -> [3, 9] (inclusive!)
box_2 submarine is in the box [0, 0] -> [15, 15] (inclusive!)
evens submarine has x>=8 and an even y coordinate
odds_not_box the odds that the submarine is not in the box [0, 5] -> [5,10]
odds_even the odds that the submarine has an x coordinate which is even
logit_3_7 the log-odds (logits) that the submarine is in square [3, 7]
dlogit_box the change in log-odds from the hypothesis that the submarine is in the box [0,5] -> [5, 10] to being in the box [1,1] -> [5,5]
# YOUR CODE HERE
raise NotImplementedError()
with tick.marks(1):
assert check_scalar(in_1_4, 0xe7d4ba8d)
with tick.marks(1):
assert check_scalar(out_3_3, 0x99c03236)
with tick.marks(1):
assert check_scalar(west_6, 0x59f3b9f7)
with tick.marks(1):
assert check_scalar(y_div_3, 0xdaa43a7a)
with tick.marks(1):
assert check_scalar(below_above, 0x8ea462bf)
with tick.marks(1):
assert check_scalar(box_1, 0x7db96774)
with tick.marks(1):
assert check_scalar(box_2, 0xb44c37ea)
with tick.marks(1):
assert check_scalar(evens, 0xe44a25c7)
with tick.marks(1):
assert check_scalar(odds_not_box, 0x88193758)
## Hidden test checking odds_even [1 marks]
with tick.marks(1):
assert check_scalar(logit_3_7, 0x3f1ff384)
# Hidden test checking dlogit_box [2 marks]
Expected value
You need to plan for the search operation. There is a fixed search station at square x=6, y=6. Assume the Euclidean (L2) norm in measuring distances. The time in hours to search a grid square from the station is given by time = distance**2 + 4*distance + 10.Compute:
expected_location Expected value of the submarine location.
expected_distance expected value of the distance of the submarine to this fixed station
total_search_time the total search time required to search the every square on the entire grid. Assume each search of a grid square has to restart from the station.
expected_search_time expected search time of a random search starting at the station.
REMEMBER, in general $$E[f(X)]
eq f(E[X]).$$
show_pmf(submarine_pmf)
plt.plot(6,6, C2o, label=Search station, markersize=10)
plt.legend(loc=lower left);
# YOUR CODE HERE
raise NotImplementedError()
with tick.marks(2):
assert check_scalar(expected_location[0], 0x4f9f5e7c)
assert check_scalar(expected_location[1], 0x8805a495)
with tick.marks(2):
assert check_scalar(expected_distance, 0x33d35673)
with tick.marks(2):
assert check_scalar(total_search_time, 0xb1ca4255)
# Hidden test checking expected_search_time [3 marks]
Part 1.3 Conditional probability
The search strategy could be improved if one of the $x$ or $y$ coordinates were known for sure (e.g. from another source, like satellite imaging). To identify how much this would help, we can use conditional probability distributions. Compute the following:
p_x_y the conditional PMF of an finding the submarine on an x coordinate given a y coordinate $P(X_x=x|X_y=y)$ as 1616 matrix
p_y_x the conditional PMF of finding the submarine on a y coordinate given an x coordinate $P(X_y=y|X_x=x)$ as 1616 matrix
p_y_6 the conditional PMF of the submarine being in squares with y=6, as a single 16 element vector.
p_even_x the PMF of the submarine being on a given y coordinate if we know the submarine is in a grid square with even x coordinate, as a single 16 element vector. $P(X_y=y|X_x = x, x text{even})$
p_even_y_odd_x the PMF of the submarine being on an even y coordinate if we know the submarine is in a grid square with odd x coordinate, as a single 8 element vector. $P(X_y=y|X_x = x, x text{odd}, y text{even})$
# YOUR CODE HERE
raise NotImplementedError()
with tick.marks(2):
show_pmf(p_x_y, P(x|y))
assert check_hash(p_x_y, ((16, 16), 2103.534977619737))
# YOUR CODE HERE
raise NotImplementedError()
with tick.marks(2):
show_pmf(p_y_x, P(y|x))
assert check_hash(p_y_x, ((16, 16), 2031.7414411989753))
# YOUR CODE HERE
raise NotImplementedError()
with tick.marks(2):
show_pmf(p_y_6[:,None], p(x|y=6))
assert check_hash(p_y_6, ((16,), 8.36846919881838))
# YOUR CODE HERE
raise NotImplementedError()
with tick.marks(2):
show_pmf(p_even_x[None,:], p(y|x even))
assert check_hash(p_even_x, ((16,), 7.1624313253524745))
# YOUR CODE HERE
raise NotImplementedError()
#Sanity check ; just to check size
with tick.marks(0):
assert check_hash(0.0*p_even_y_odd_x, ((8,), 0.0 ))
tmp_pmf = np.stack([p_even_y_odd_x, 0*p_even_y_odd_x]).T.reshape(16,)
show_pmf(tmp_pmf[None,:], p(y even|x odd))
# Hidden test checking p_even_y_odd_x [3 marks]
The search could be guided more intelligently if we could work out how informative getting one of the coordinates could be. For example, knowing which x coordinate would give you most information about the y coordinate?Compute the entropy of each conditional distribution $H(Y|X=n)$ and store it in entropy_x. Use base 2 for the entropy (i.e. bits).
Store the most useful x coordinate to know in most_informative_x.
# YOUR CODE HERE
raise NotImplementedError()
with tick.marks(5):
show_pmf(entropy_x[:, None])
assert check_hash(entropy_x, ((16,), 509.16360658756037))
#Sanity check; just to check size
with tick.marks(0):
assert check_hash(0.0*most_informative_x, ((), 0.0 ))
# Hidden test checking most_informative_x [5 marks]
Part 1.4 Validating pmfs
submarine_pmf is one possible model of the submarine location. A number of others have been proposed for this 1616 gridworld model of searching by the scientific reserach team; but some of them are implemented incorrectly and are not valid PMFs. These probability mass functions are stored in a list of matrices 1616 proposed_pmfs. Identify which of these PMFs are valid PMFs and store the valid ones in a list valid_pmfs.
# show each of the PMFs
for i,pmf in enumerate(proposed_pmfs):
show_pmf(pmf, title=fPMF {i})
# YOUR CODE HERE
raise NotImplementedError()
print(valid_pmfs)
with tick.marks(3):
check_hash(valid_pmfs, ((2, 16, 16), 511.54549199055594 ))
We can use sampling to simulate where the submarine might lie on the sea floor. Write a function sample_submarine(n) which will randomly sample submarine locations according to submarine_pmf, returning a nx2 array of sampled locations. Hint: there are 256 possible grid squares.
Note: your code may pass these tests but be incorrect; this will become apparent when you attempt the reconstruct the PMF part below.
def sample_submarine(n, pmf=submarine_pmf):
# YOUR CODE HERE
raise NotImplementedError()
# example of how to call your sample_submarine function
sample_submarine(100)
with tick.marks(3):
for n in [2, 4, 10, 1000, 5000]:
samples =sample_submarine(n)
assert samples.shape == (n,2)
assert np.sum(samples np.floor(samples)) == 0.0
mean = np.mean(samples, axis=0)
sem = np.std(samples, axis=0)/np.sqrt(n)
assert np.all((mean-sem < [7.93, 5.67]) | ([7.93, 5.67] < mean+sem)) Part 1.6 Reconstruct the PMFFrom our samples (more realistically samples would be coming from an other source), we can reconstruct an approximation of the PMF. Write a function reconstruct_pmf(samples) that will reconstruct the PMF using the empirical distribution from a set of samples and return the reconstructed PMF as a 16×16 matrix. def reconstruct_pmf(samples):# YOUR CODE HEREraise NotImplementedError()def kl(a, b):eps = 1e-9 # small constant used to avoid division by zeroreturn np.sum(np.where((b<1e-10)|(a<1e-10), 0.0, a * np.log((a+eps)/(b+eps))))with tick.marks(2):kls = [4.0, 3.5, 2.5, 0.1, 0.02]for i,n in enumerate([1, 4, 10, 1000, 5000]):approx_pmf =reconstruct_pmf(sample_submarine(n))assert approx_pmf.shape == submarine_pmf.shapeassert check_scalar(np.sum(approx_pmf), “0xb44c37ea”) assert np.max(approx_pmf)<=1.0assert np.min(approx_pmf)>=0.0
kl_div = abs(kls[i] kl(approx_pmf, submarine_pmf))/kls[i]
assert kl_div < 1.0 sample = reconstruct_pmf(sample_submarine(n))show_pmf(sample, f”{n} samples”)Part 1.7 Log likelihoodVarious theoretical models of oceanographic features have been used to simulate what happened to the submarine.zeeman_shiftEach of these models produces a collection of simulated submarine locations as a result.submarine_samples provides a dictionary with samples from various named submarine position generating functions. Compute the log-likelihood of each sequence $log mathcal{L}(x_1, x_2,dots)$ under the model defined by the PMF submarine_pmf.sequence_lliksa dictionary that maps the names to the log-likelihood.most_likely_sequence the name of the sequence is most likely to be compatible with this model. print(submarine_samples.keys())def llik(seq):# YOUR CODE HEREraise NotImplementedError()hashes = {“crater”: “0x7315af59″,”zeeman_shift”: “0x73114def”,”oblov_1″: “0xeb8d319c”,”minority”: “0xf8150629″,”inviscid”: “0xfab3048e”,with tick.marks(3):for key in submarine_samples.keys():print(key, sequence_lliks[key])assert check_scalar(sequence_lliks[key], hashes[key])# Hidden test checking the string most_likely_sequence [1 mark] Part 1.8 Bayes’ ruleThe function search_submarine(x,y) returns the likelihood for the submarine’s location, given an x and y coordinate to search at based on sonar returns from a search ship. It returns this as a 16×16 matrix.Using submarine_pmf as a prior, compute the posterior PMF using the result of search_submarine(10, 7) as likelihood. Store the result in submarine_posterior. # YOUR CODE HEREraise NotImplementedError()show_pmf(submarine_pmf, ‘Prior’)show_pmf(search_submarine(10,7), ‘Evidence’ )show_pmf(submarine_posterior, ‘Posterior’)with tick.marks(5):assert check_hash(submarine_posterior, ((16, 16), 129.2249265521699)) Part 1.9 Sequential searchingCompute the posterior distribution after searching the each of squares y=8, x=0..15 (i.e. testing each of these 16 squares in turn, and combining all of the evidence into a posterior), storing the posteriors as a list, starting with the prior and ending with the posterior after observing y=8, x=15. You should have a 17 element list of 16×16 matrices. Store this as search_strip_posterior # YOUR CODE HEREraise NotImplementedError()# Sanity check that the size is ok [0 marks]with tick.marks(0):assert(len(search_strip_posterior)==17)assert(search_strip_posterior[0].shape==(16, 16)) # only check one of the 17 matricies# Show the 17 individual posteriorsfor i, p in enumerate(search_strip_posterior):show_pmf(p, f’Posterior after searching y=8, x={i}’)# Hidden test checking search_strip_posterior [6 marks ] Part 2. Hunt the submarine: continuous versionThe simple discrete gri CS: assignmentchef QQ: 1823890830 Email: [email protected]
Reviews
There are no reviews yet.