You may be asking yourself what is the importance of learning about Data Science and Machine Learning in a cybersecurity class?, The short answer is that data science is a useful set of tools to handle the massive amount of data that flow through IT systems and it is used by many security teams either explicitly or within tools/programs they use so it is important to get a basic understanding of how it works. This Project will go through a simplified scenario where data science can be used, if this sparks your interest there are plenty of other ML focused classes at GaTech that you may be interested in taking, as well as a wealth of training materials on Youtube, Coursera, Udacity, Udemy, DataCamp etc that you could use to go deeper into the field.You are an analyst on a security team for a midsized software company that runs a messaging app (a slack, gchat, microsoft teams competitor). It is Monday morning and you see an email from your manager setting up a meeting to discuss a new security feature that needs to be implemented in the product ASAP. You join the meeting and learn that recently there has been a big uptick in malicious executable files being sent over the chat app and it is starting to generate bad press for the company. A few analysts on the team already worked on analyzing a set of files sent over the app and classifying them as malicious or benign. They also used a python library (pefile) to get some attributes of each executable file and have created a CSV with those extracted attributes and a column with the name class with a 1 denoting a malicious file and a 0 denoting a benign files. They documented their preprocessing work in a readme in the git repo (urwithajit9/ClaMP: A Malware classifier dataset built with header fields values of Portable Executable files (github.com)) and shared the repo with software engineers so they can get to work writing code that will generate those features for every executable file sent over the messaging app. Your boss turns to you and says I would like you to help us to understand a bit more about how big of a problem this is on our app and write a model that takes in these features and produces a propensity score from 0 to 1 where scores closer to 0 mean a low likelihood of the file being malicious and closer to a 1 means a higher likelihood of a file being malicious. Also since the team may want to reuse this type of work in the future for different types of files or with different extracted attributes you should create functions that can be used in the future with minimal rework. Once you produce a model, you will share your code and the trained model file with the software engineers who will integrate the model into the messaging app and will score all files uploaded to the app.We have a Canvas quiz that is meant to test that you have read the library documentation for the packages we use for this class. It is not meant to be tricky and can be completed before you start the project or after you finish it.Useful Links:Deliverables:Lets first get familiar with some pandas basics. Pandas is a library that handles data frames which you can think of as a python class that handles tabular data. In this section you will make a very simple function that takes in a pandas dataframe of the file attributes and classifications and returns some simple attributes. See the function skeleton and implement a count of rows, count of columns, count of rows where the classification is 1 (positive), count of rows where the classification is 0 (negative) and a percentage of classification of 1 (percent positive) in the datasets target column. Generally in the real world you would also use plotting tools like PowerBi, Tableau, Data Studio, Matplotlib etc to create graphics and other visuals to better understand the dataset you are working with, this step is generally known as Exploratory Data Analysis. Since we are using an autograder for this class we will skip the plotting for this project. For this task we have released a test suite on ed discussion if you are struggling to run things locally please set that up and use it to debug your function.Useful Links:Deliverables:Now that you have a basic understanding of pandas and the dataset it is time to dive into some more complex data processing tasks. The first subtask in this task is splitting your dataset into both features and targets (columns) and splitting your dataset into training and test sets (rows). These are basic concepts in model building but at a high level it is important to hold out a subset of your data when you train a model so you can see what the expected performance is on unseen samples and so you can determine if the resulting model is overfit (performs much better on training data vs test data). Preprocessing data is important since most models only take in numerical values so categorical features need to be encoded to numerical values so models can use them. Numerical scaling can be more or less useful depending on the type of model used but is especially important in linear models. These preprocessing techniques will provide you options to augment your dataset and improve model performance.For one hot encoding and scaling functions you should return a dataframe with the encoded/scaled columns concatenated to the other columns you did not transform. For the PCA functions you should return just the PCA dataframe. For the feature engineering dataframe you should return the feature engineered column attached to the input dataframe.Example Output for one hot encoding (where color and version are encoded):Note: For these functions (and in data science/ML in general) you should be training/fitting to the train set and predicting/transforming on the train and test sets.Note: for onehot encoding please use the format columnName_columnValue for the new encoded column names. Ie in the sklearn example of a column named `gender` and values `male` and `female`, you would put `gender_male` and `gender_female` as your new column names after one hot encoding. If the test set has a third gender then you would denote that value you didnt see at training time with 0s for each encoded column.Note: When using sklearn.preprocessing.OneHotEncoder check the output from the transformation it may cause issues if it is not a format that pandas dataframes expect. To see this and to debug it, try running your code locally (or on a google colab notebook) using the training csv with a few categorical columns.Note: for PCA you should drop columns with na values before running fitting or transforming methods (in the real world you could also try to fill those missing values but for this project that is out of our scope)Note: if you see the error `Test Failed: Found unknown categories` for one hot encoding make sure you check the initialization of the OneHotEncoder and make sure you are handling values in the test set that were not in the training set (we want to set those unseen values encoded as all 0s for that encoded column as we describe in the 2nd Note above)Note: For this task we have released a test suite on ed discussion if you are struggling to run things locally please set that up and use it to debug your function.Useful Links:Deliverables:So far we have functions to split the data and preprocessed it. Now we will run a basic model on the data to cluster files (rows) with similar attributes together. We will use an unsupervised (model with no target column) model, Kmeans, since it is simple to use and understand. Please use scikit-learn to create the model and Yellowbrick to determine the optimal value of k for our dataset.Useful Links:Deliverables:Finally we are ready to try a few different supervised classification models. We have chosen a few commonly used models for you to use here but there are many options and in the real world specific algorithms may fit a specific dataset better. You also wont be doing any hyperparameter tuning yet to better focus on writing the code. You will train a model using the training set, predict on the training/test sets and calculate performance metrics and return a ModelMetrics object and trained scikit-learn model from each model function. (Note: You should use RFE for determining feature importance of the logistic regression model but do NOT use RFE for random forest or gradient boosting models to determine feature importance please use their built in values for this)Useful Links:Deliverables:Now that you have written functions for different steps of the model building process you will put it all together. You will write code that trains a model with hyperparameters you determine (you should do any tuning locally or in a notebook ie dont tune your model in gradescope since the autograder will likely timeout). It will take in the CLAMP training data, train a model then predict on a test set and output values from 0 to 1 for each row and our autograder will compare your predictions with the correct answers and to get credit you will need a roc auc score of .9 or higher on the test set (should not require much hyperparameter tuning for this dataset). This is basically a simulation of how your model would perform in the production system using batch inference.Deliverables:
Reviews
There are no reviews yet.