[Solved] NCTU-CS Assignment #1 Nave Bayes

$25

File Name: NCTU_CS_Assignment__1__Nave_Bayes.zip
File Size: 310.86 KB

SKU: [Solved] NCTU-CS Assignment #1 – Naïve Bayes Category: Tag:
5/5 - (1 vote)

There are two datasets that need to be analyzed. For each dataset, you have to do the following:

  1. Data Input 5%
  2. Data Visualization 5%
    • For mushroom dataset
      • Show the data distribution by value frequency of every feature.
    • For Iris dataset
      • Show the data distribution by average, standard deviation, and value frequency(binning might be needed) of every feature.
    • Split data based on their labels (targets) and show the data distribution of each feature again.
  3. Data Preprocessing 5% + (10%)
    • Drop features with any missing value.
    • Transform data format and shape so your model can process them.
    • Shuffle the data.
    • Bonus: any other transformation boosts the final performance. (10%)
  4. Model Construction 20%
    • You must construct two Nave Bayes classifiers for the two datasets.
    • Nave Bayes divider M(q)=argmaxY∈T[log⁡P(Y)+∑i=1mlog⁡P(Xi|Y)]“>M(q)=argmaxYT[logP(Y)+mi=1logP(Xi|Y)]M(q)=argmaxYT[logP(Y)+i=1mlogP(Xi|Y)]
      • where X1“>X1X1 to T“>TT is the set of all possible labels.
  5. For the mushroom dataset, whose features are all categorical, P(Xi|Y)=N(Xi|Y)N(Y)“>P(Xi|Y)=N(Xi|Y)N(Y)P(Xi|Y)=N(Xi|Y)N(Y)
  6. Laplace smoothing
    • τ“> is the number of all possible events of feature P(Xi|Y)“>P(Xi|Y)P(Xi|Y) follows a 1D-Normal(Gaussian) distribution. 10%
      • μ,σ“>,, are the mean and standard deviation of feature Y“>YY is determined.
  7. Train-Test-Split 5%
    • Two validation methods need to be implemented.
  1. Holdout validation with the ratio
      1. K-fold cross-validation with P(Xstalk−color−below−ring|Y=e)“>P(Xstalkcolorbelowring|Y=e)P(Xstalkcolorbelowring|Y=e) with and without Laplace smoothing by histograms 10%
    • For Iris dataset
      1. What are the values of σ“> of assumed P(Xpetal_length|Y=Iris Versicolour)“>P(Xpetal_length|Y=Iris Versicolour)P(Xpetal_length|Y=Iris Versicolour) 10%
  2. Finish during class 20%
    • Submit your report and source codes to the newE3 system before class ends.
    • Finish time will be determined by the submission time.

Data

1. Mushroom dataset

  • Data can be downloaded here:
  • Please NOTE that the first column is the label (edible=e, poisonous=p)
  • Data Set Information
    • This data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family (pp. 500-525). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like leaflets three, let it be for Poisonous Oak and Ivy.
  • Attribute Information
  1. cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s
    1. cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s
    2. cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r, pink=p,purple=u,red=e,white=w,yellow=y
    3. bruises?: bruises=t,no=f
    4. odor: almond=a,anise=l,creosote=c,fishy=y,foul=f, musty=m,none=n,pungent=p,spicy=s
    5. gill-attachment: attached=a,descending=d,free=f,notched=n
    6. gill-spacing: close=c,crowded=w,distant=d
    7. gill-size: broad=b,narrow=n
    8. gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e, white=w,yellow=y
    9. stalk-shape: enlarging=e,tapering=t
    10. stalk-root: bulbous=b,club=c,cup=u,equal=e, rhizomorphs=z,rooted=r, missing=?
    11. stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s
    12. stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s
    13. stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y
    14. stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y
    15. veil-type: partial=p,universal=u
    16. veil-color: brown=n,orange=o,white=w,yellow=y
    17. ring-number: none=n,one=o,two=t
    18. ring-type: cobwebby=c,evanescent=e,flaring=f,large=l, none=n,pendant=p,sheathing=s,zone=z
    19. spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r, orange=o,purple=u,white=w,yellow=y
    20. population: abundant=a,clustered=c,numerous=n, scattered=s,several=v,solitary=y
    21. habitat: grasses=g,leaves=l,meadows=m,paths=p, urban=u,waste=w,woods=d

2. Iris dataset

  • Data can be downloaded here:
  • Data Set Information
    • This is perhaps the best known database to be found in the pattern recognition literature. Fishers paper is a classic in the field and is referenced frequently to this day. (See Duda & Hart, for example.) The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other. Predicted attribute: class of iris plant. This is an exceedingly simple domain.
  • Attribute Information
    1. sepal length in cm
    2. sepal width in cm
    3. petal length in cm
    4. petal width in cm
    5. class:
      • Iris Setosa
      • Iris Versicolour
      • Iris Virginica

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[Solved] NCTU-CS Assignment #1  Nave Bayes[Solved] NCTU-CS Assignment #1 Nave Bayes
$25