Accurate knowledge of a patient’s disease state is crucial to proper treatment, and we must understand a patient’s phenotypes (based on their health records) to predict their disease state. There are several strategies for phenotyping including supervised rule-based methods and unsupervised methods. In this homework, you will implement both type of phenotyping algorithms using Spark.This homework is primarily about using Spark with Scala. We strongly recommend using our bootcamp virtual environment setup to prevent compatibility issues. However, since we use the Scala Build Tool (SBT), you should be fine running it on your local machine. Note this homework requires Spark 1.3.1 and is not compatible with Spark 2.0 and later. Please see the build.sbt file for the full list of dependencies and versions.Begin the homework by downloading the hw3.tar.gz from Canvas, which includes the skeleton code and test cases.You should be able to immediately begin compiling and running the code with the following command (from the code/ folder):sbt/sbt compile runAnd you can run the test cases with this command:Phenotyping can be done using a rule-based method. The Phenotype Knowledge Base (PheKB) provides a set of rule-based methods (typically in the form of decision trees) for determining whether or not a patient fits a particular phenotype.In this assignment, you will implement a phenotyping algorithm for type-2 diabetes based on the flowcharts below. The algorithm should:You will implement the Diabetes Mellitus Type 2 algorithms from PheKB. We have reduced the rules for simplicity, which you can find in the images below. However, you can refer to the full description for more details if desired.The following files in code/data/ folder will be used as inputs:Figure 1: Determination of casesFor your project, you will load input CSV files from the code/data/ folder. You are responsible for transforming the .csv’s from this folder into RDDs.The simplified rules which you should follow for phenotyping of Diabetes Mellitus Type 2 are shown below. These rules are based off of the criteria from the PheKB phenotypes, which have been placed in the phenotyping resources/ folder.Figure 2: Determination of controlsIn order to help you verify your steps, expected counts along the different steps have been provided in:Any patients not found to be in the control or case category should be placed in the unknown category. Additional hints and notes are provided directly in the code comments, so please read these carefully.in the data folder as structured RDDs. [5 points]At this point you have implemented a supervised, rule-based phenotyping algorithm. This type of method is great for picking out specific diseases, in our case diabetes, but they are not good for discovering new, complex phenotypes. Such phenotypes can be disease subtypes (i.e. severe hypertension, moderate hypertension, mild hypertension) or they can reflect combinations of diseases that patients may present with (e.g. a patient with hypertension and renal failure). This is where unsupervised learning comes in.You will need to start by constructing features out of the raw data to feed into the clustering algorithms. You will need to implement ETL using Spark with similar functionality as what you did in last homework using Pig. Since you know the diagnoses (in the form of ICD9 codes) each patient exhibits and the medications they took, you can aggregate this information to create features. Using the RDDs that you created in edu.gatech.cse8803.main.Main.loadRddRawData, you will construct features for the COUNT of medications, COUNT of diagnoses, and AVERAGE lab test value.to create two types of features: one using all the available ICD9 codes, labs, and medications, and another using only features related to the phenotype. See the comments of the source code for details.Purity is a metrics to measure the quality of clustering, it’s defined aswhere N is the number of samples, k is index of clusters and j is index of class. wk denotes the set of samples in k-th cluster and cj denotes set of samples of class j.Now you will perform clustering using Spark’s MLLib, which contains an implementation of the k-means clustering algorithm as well as the Gaussian Mixture Model algorithm.From the clustering, we can discover groups of patients with similar characteristics. You will cluster the patients based upon diagnoses, labs, and medications. If there are d distinct diagnoses, l distinct medications and m medications, then there should be d + l + m distinct features.code in edu.gatech.cse8803.main.Main.scala:testClustering.Table 1: Clustering with 3 centers using all featuresTable 2: Clustering with 3 centers using filtered featuresWhen data arrive in a stream, we may want to estimate clusters dynamically and update them as new data arrives. Spark’s MLLib provides support for the streaming k-means clustering algorithm that uses a generalization of the mini-batch k-means algorithm with forgetfulness.We’ll now summarize what we’ve observed in the preceeding sections:and 2.4b using the GMM algorithm.NOTE: Please change k back to 3 in your final code deliverable!Table 3: Purity values for different number of clustersGiven a feature matrix V , the objective of NMF is to minimize the Euclidean distance between the original non-negative matrix V and its non-negative decomposition W × H which can be formulated as1argmin ||V − WH (1)Wwhere V , W and H . V can be considered as a dataset comprised of n number of m-dimensional data vectors, and r is generally smaller than n.To obtain a W and H which will minimize the Euclidean distance between the original non-negative matrix B, we use the Multiplicative Update (MU). It defines the update rule for Wij and Hij ast+1 t (V H>)ijWij = Wij (W tHH>)ij t+1 t (W>V )ijHij = Hij (W>WHt)ijYou will decompose your feature matrix V , from 2.1, into W and H. In this equation, each row of V represents one patient’s features and a corresponding row in W is the patient’s cluster assignment, similar to a Gaussian mixture. For example, let r = 3 to find three phenotype(cluster), if row 1 of W is (0.23,0.45,0.12), you can say this patient should be group to second phenotype as 0.45 is the largest element.W can be very large, i.e. a billion patients, which must be worked on in a distributed fashion while H is relatively small and can fit into a single machine’s memory. You will define these two types of matricies as distributed RowMatrix and local dense Matrix respectively in the skeleton code.bonus]The folder structure of your submission should be as below or your code will not be graded. You can display fold structure using tree command. All other unrelated files will be discarded during testing. You may add additional methods, additional dependencies, but make sure existing methods signature doesn’t change. It’s your duty to make sure your code can be compiled with the provided SBT. Be aware that writeup is within code root.Create a tar archive of the folder above with the following command and submit the tar file.
Reviews
There are no reviews yet.