2 Splitting Datasets into Training and Testing [10 points]
A common task in ML is to divide a dataset of independent instances into a training set and a test set. These two sets can then be used to measure how well an ML method generalizes to data it hasnt seen before: we fit the method to the training set, then evaluate the trained method on the test set.
In this problem, youll demonstrate basic understanding of NumPy array indexing and random number generation by writing a procedure to divide an input dataset of L instances into a training set (of size M) and a test set (of size N). Each row of the original dataset should be exclusively assigned to either train or test.
How do we set the values of M and N? Your function will take a keyword argument frac test that specifies the number of test examples (N) as a fraction of the overall dataset size. To compute N, we always want to round up to the nearest whole number: N = ceil(frac test L)
We want the test set to be a uniform at random subset. You should look at the NumPy API for Random Sampling. Functions like shuffle or permutation might be helpful.
We also want the test set to be reproducible by specifying particular random seed. That is, if I run the code now to extract a train/test set, if I need to rerun the code later Id like to be able to recover the exact same train/test assignments if needed. With NumPy, the common way to do this is by specifying a random state keyword argument that can either take an integer seed (0, 42, 1337, etc.) or an instance of the RandomState class. Specifying the same random state should deliver the same pseudo-randomness across multiple calls to a function.
Use the starter code in spitDataset.py for function definition and a detailed specification. You need to fill in the missing code. Remember, use exclusively NumPy functions. No calls to any functions in sklearn or other libraries.
Reviews
There are no reviews yet.