diff --git a/HW3-Markdown-Austin-Sampson.html b/HW3-Markdown-Austin-Sampson.html new file mode 100644 index 0000000000000000000000000000000000000000..f7b247a310f4187488d3377a358114fde996eb7d --- /dev/null +++ b/HW3-Markdown-Austin-Sampson.html @@ -0,0 +1,807 @@ + + + + + + + + + + + + + +1R Mammals Report + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + +

Name: Austin Sampson

+

eMail:

+

Course: CS 5402

+

Date: 02-14-2020

+
+

Concept Description:

+

Train a system from existing data to classify animals as either mammal or non-mammal

+
+
+

Data Collection:

+

The data has been provided by Perry B. Koob, not professor or doctor. It is a modified version of the UCI Mushroom data set found here on canvas.

+
+
+

Example Description:

+

animal.name

+

nominal attribute name of the animal or species

+

hair

+

Nominal boolean attribute that displays output as: True False

+

feathers

+

Nominal boolean attribute that displays output as: True False

+

eggs

+

Nominal boolean attribute that displays output as: True False

+

milk

+

Nominal boolean attribute that displays output as: True False

+

airborne

+

Nominal boolean attribute that displays output as: True False

+

aquatic

+

Nominal boolean attribute that displays output as: True False

+

preditor

+

Nominal boolean attribute that displays output as: True False

+

toothed

+

Nominal boolean attribute that displays output as: True False

+

backbone

+

Nominal boolean attribute that displays output as: True False

+

breathes

+

Nominal boolean attribute that displays output as: True False

+

venomous

+

Nominal boolean attribute that displays output as: True False

+

fins

+

Nominal boolean attribute that displays output as: True False

+

legs

+

Ratio Lable displaying the number of legs. null value or 0 indicates the absince of legs.

+

tail

+

Nominal boolean attribute that displays output as: True False

+

domestic

+

Nominal boolean attribute that displays output as: True False

+

catsize

+

Nominal boolean attribute that displays output as: True False

+

gestation

+

Interval attribute displays a measure of time it took for the gestation of a species. There were missing values for this attribute in 2 examples those examples were removed.

+

type

+

Nominal, Main classificationn variable for this data set. Output displayed as: mammal fish arthropod bird insect amphibian reptile

+
+
+

Data Import and Wrangling:

+

Importing test and training data

+
#import main file
+train <- read.csv("animal-taxonomy-train.csv")
+test <- read.csv("animal-taxonomy-test.csv")
+

Standardizing Classifcation attrbute. We are testing if an animal is a mammal or non-mammal thefore reclassify all type not mammal as non-mammal. remove unused levels from type

+
levels(train$type) <- c(levels(train$type), 'non-mammal')
+train$type[train$type != 'mammal'] = 'non-mammal'
+levels(train$type) <- droplevels(train$type)
+
+#Do thje same thing for test data
+#prepare the list of classes from the test data for evaluation
+levels(test$type) <- c(levels(test$type), "non-mammal")
+#convert novert types to mammal and non-mammal only
+test$type[test$type != "mammal"] = "non-mammal"
+#drop additional levels
+#levels(test$type) <- droplevels(test$type)
+test$type <- factor(test$type)
+
+
+

Mining and Analytics:

+

Manual OneR classification

+

generate confusion matrices for each attribute excluding animal.name

+
hair <- count(train, hair, type)
+feathers <- count(train, feathers, type)
+eggs <- count(train, eggs, type)
+milk <- count(train, milk, type)
+airborne <- count(train, airborne, type)
+aquatic <- count(train, aquatic, type)
+predator <- count(train, predator, type)
+toothed <- count(train, toothed, type)
+backbone <- count(train, backbone, type)
+breathes <- count(train, breathes, type)
+venomous <- count(train, venomous, type)
+fins <- count(train, fins, type)
+legs <- count(train, legs, type)
+tail <- count(train, tail, type)
+domestic <- count(train, domestic, type)
+catsize <- count(train, catsize, type)
+gestation <- count(train, gestation, type)
+

Display fraphs and rules (note: errors were calculated manualy)

+
hair
+
## # A tibble: 4 x 3
+##   hair  type           n
+##   <lgl> <fct>      <int>
+## 1 FALSE mammal         1
+## 2 FALSE non-mammal    52
+## 3 TRUE  mammal        34
+## 4 TRUE  non-mammal     4
+

Based on the frequency of mammal and non-mammal we generated the following rules: If no hair -> non-mammal if has hari -> mammal

+
feathers
+
## # A tibble: 3 x 3
+##   feathers type           n
+##   <lgl>    <fct>      <int>
+## 1 FALSE    mammal        35
+## 2 FALSE    non-mammal    39
+## 3 TRUE     non-mammal    17
+

Based on the frequency of mammal and non-mammal we generated the following rules: if no feathers -> non-mammal if feathers -> mammal

+
eggs
+
## # A tibble: 3 x 3
+##   eggs  type           n
+##   <lgl> <fct>      <int>
+## 1 FALSE mammal        35
+## 2 FALSE non-mammal     2
+## 3 TRUE  non-mammal    54
+

Based on the frequency of mammal and non-mammal we generated the following rules: if No eggs -> mammal if has eggs -> mammal

+
milk
+
## # A tibble: 2 x 3
+##   milk  type           n
+##   <lgl> <fct>      <int>
+## 1 FALSE non-mammal    56
+## 2 TRUE  mammal        35
+

Based on the frequency of mammal and non-mammal we generated the following rules: if no milk -> non-mammal if has milk -> mammal

+
airborne
+
## # A tibble: 4 x 3
+##   airborne type           n
+##   <lgl>    <fct>      <int>
+## 1 FALSE    mammal        33
+## 2 FALSE    non-mammal    37
+## 3 TRUE     mammal         2
+## 4 TRUE     non-mammal    19
+

Based on the frequency of mammal and non-mammal we generated the following rules: if not airborne -> non-mammal if is airborne -> non-mammal

+
aquatic
+
## # A tibble: 4 x 3
+##   aquatic type           n
+##   <lgl>   <fct>      <int>
+## 1 FALSE   mammal        31
+## 2 FALSE   non-mammal    27
+## 3 TRUE    mammal         4
+## 4 TRUE    non-mammal    29
+

Based on the frequency of mammal and non-mammal we generated the following rules: if not aquatic -> mammal if is auqatic -> non-mammal

+
predator
+
## # A tibble: 4 x 3
+##   predator type           n
+##   <lgl>    <fct>      <int>
+## 1 FALSE    mammal        18
+## 2 FALSE    non-mammal    23
+## 3 TRUE     mammal        17
+## 4 TRUE     non-mammal    33
+

Based on the frequency of mammal and non-mammal we generated the following rules: if not preditor -> non-mammal if is preditor -> non-mammal

+
toothed
+
## # A tibble: 3 x 3
+##   toothed type           n
+##   <lgl>   <fct>      <int>
+## 1 FALSE   non-mammal    36
+## 2 TRUE    mammal        35
+## 3 TRUE    non-mammal    20
+

Based on the frequency of mammal and non-mammal we generated the following rules: if not toothed -> non-mammal if is toothed -> mammal

+
backbone
+
## # A tibble: 3 x 3
+##   backbone type           n
+##   <lgl>    <fct>      <int>
+## 1 FALSE    non-mammal    18
+## 2 TRUE     mammal        35
+## 3 TRUE     non-mammal    38
+

Based on the frequency of mammal and non-mammal we generated the following rules: if no backbone -> non-mammal if has backbone -> non-mammal

+
breathes
+
## # A tibble: 3 x 3
+##   breathes type           n
+##   <lgl>    <fct>      <int>
+## 1 FALSE    non-mammal    20
+## 2 TRUE     mammal        35
+## 3 TRUE     non-mammal    36
+

Based on the frequency of mammal and non-mammal we generated the following rules: if doesnt breath -> non-mammal if does breath -> non-mammal

+
venomous
+
## # A tibble: 3 x 3
+##   venomous type           n
+##   <lgl>    <fct>      <int>
+## 1 FALSE    mammal        35
+## 2 FALSE    non-mammal    48
+## 3 TRUE     non-mammal     8
+

Based on the frequency of mammal and non-mammal we generated the following rules: if not venomous -> non-mammal if is venomous -> non-mammal

+
fins
+
## # A tibble: 4 x 3
+##   fins  type           n
+##   <lgl> <fct>      <int>
+## 1 FALSE mammal        32
+## 2 FALSE non-mammal    44
+## 3 TRUE  mammal         3
+## 4 TRUE  non-mammal    12
+

Based on the frequency of mammal and non-mammal we generated the following rules: if no fins -> non-mammal if has fins -> non-mammal

+
legs
+
## # A tibble: 9 x 3
+##    legs type           n
+##   <int> <fct>      <int>
+## 1     0 mammal         2
+## 2     0 non-mammal    19
+## 3     2 mammal         7
+## 4     2 non-mammal    17
+## 5     4 mammal        26
+## 6     4 non-mammal     7
+## 7     5 non-mammal     1
+## 8     6 non-mammal    10
+## 9     8 non-mammal     2
+

Based on the frequency of mammal and non-mammal we generated the following rules: if legs < 2 -> non-mamma if legs = 4 -> mammal if legs >= 5 non-mammal

+
tail
+
## # A tibble: 4 x 3
+##   tail  type           n
+##   <lgl> <fct>      <int>
+## 1 FALSE mammal         5
+## 2 FALSE non-mammal    20
+## 3 TRUE  mammal        30
+## 4 TRUE  non-mammal    36
+

Based on the frequency of mammal and non-mammal we generated the following rules: if no tail -> non-mammal if has tail -> non-mammal

+
domestic
+
## # A tibble: 4 x 3
+##   domestic type           n
+##   <lgl>    <fct>      <int>
+## 1 FALSE    mammal        27
+## 2 FALSE    non-mammal    53
+## 3 TRUE     mammal         8
+## 4 TRUE     non-mammal     3
+

Based on the frequency of mammal and non-mammal we generated the following rules: if not domestic -> non-mammal if domestic -> mammal

+
catsize
+
## # A tibble: 4 x 3
+##   catsize type           n
+##   <lgl>   <fct>      <int>
+## 1 FALSE   mammal         9
+## 2 FALSE   non-mammal    46
+## 3 TRUE    mammal        26
+## 4 TRUE    non-mammal    10
+

Based on the frequency of mammal and non-mammal we generated the following rules: if not catsize -> non-mammal if is catsize -> mammal

+
gestation
+
## # A tibble: 70 x 3
+##    gestation type           n
+##        <int> <fct>      <int>
+##  1         1 non-mammal     1
+##  2         2 non-mammal     2
+##  3         3 non-mammal     2
+##  4         5 non-mammal     2
+##  5         8 non-mammal     1
+##  6        10 non-mammal     3
+##  7        12 non-mammal     2
+##  8        14 mammal         1
+##  9        14 non-mammal     2
+## 10        15 non-mammal     1
+## # ... with 60 more rows
+

if gestation <= 56 -> non-mammal if gestation > 56 -> mammal

+

Errors for each ruleset

+

(note: Errors were calculated manually)

+

Airborne Error = 0.384

+

Aquatic Error = 0.318

+

Backbone Error = 0.384

+

Breaths Error = 0.384

+

catsize Error = 0.208

+

Domestic Error = 0.329

+

Eggs Error = 0.329

+

Feathers Error = 0.384

+

Fins Error = 0.384

+

Hair Error = 0.054

+

Tail Error = 0.384

+

Toothed Error = 0.219

+

Venomous Error = 0.384

+

milk Error = 0.0

+

Preditor Error = 0.3846

+

legs Error = 0.1758

+

gestation Error= 0.234

+

Getting the 1R rules from the OneR package

+
temp <- subset(train, select = -c(animal.name))
+model <- OneR(temp,  verbose = TRUE)
+
## Warning in bin(data): 2 instance(s) removed due to missing values
+
## 
+##     Attribute Accuracy
+## 1 * milk      100%    
+## 2   eggs      97.75%  
+## 3   hair      94.38%  
+## 4   legs      82.02%  
+## 5   catsize   78.65%  
+## 6   toothed   77.53%  
+## 7   gestation 69.66%  
+## 8   aquatic   66.29%  
+## 8   domestic  66.29%  
+## 10  feathers  60.67%  
+## 10  airborne  60.67%  
+## 10  predator  60.67%  
+## 10  backbone  60.67%  
+## 10  breathes  60.67%  
+## 10  venomous  60.67%  
+## 10  fins      60.67%  
+## 10  tail      60.67%  
+## ---
+## Chosen attribute due to accuracy
+## and ties method (if applicable): '*'
+
modelPredictions <- predict(model, test)
+#two instances will be removed do to missing values as stated above in Example description
+
+
+

Evaluation:

+

OneR package

+
eval_model(modelPredictions, test)
+
## 
+## Confusion matrix (absolute):
+##             Actual
+## Prediction   mammal non-mammal Sum
+##   mammal          6          0   6
+##   non-mammal      0          4   4
+##   Sum             6          4  10
+## 
+## Confusion matrix (relative):
+##             Actual
+## Prediction   mammal non-mammal Sum
+##   mammal        0.6        0.0 0.6
+##   non-mammal    0.0        0.4 0.4
+##   Sum           0.6        0.4 1.0
+## 
+## Accuracy:
+## 1 (10/10)
+## 
+## Error rate:
+## 0 (0/10)
+## 
+## Error rate reduction (vs. base rate):
+## 1 (p-value = 0.006047)
+

F1 Score

+

precission = TP/(TP+FP) = 1

+

Recall = TP/(TP+FN) = 0.6

+

F1 Score = (2 * precision * Recall)/sum(precision,recall) = 0.75

+

Manual 1R classifier

+
#reloading Test CSV to ensure no errors
+#test <- read.csv("animal-taxonomy-test.csv")
+
+
+reference <- as.data.frame(test$type)
+colnames(reference) <- c("class")
+reference <- as.factor(reference$class)
+#levels(reference) <- c(levels(reference), "mammal","non-mammal")
+
+
+#prepare the list of predictions from the test data for evaluation
+testing <- as.data.frame(test$milk)
+colnames(testing) <- c("milk")
+testing$pred[testing$milk==FALSE] <- "non-mammal"
+testing$pred[testing$milk==TRUE] <- "mammal"
+testing <- as.factor(testing$pred)
+confusionMatrix(testing, reference)
+
## Confusion Matrix and Statistics
+## 
+##             Reference
+## Prediction   mammal non-mammal
+##   mammal          6          0
+##   non-mammal      0          4
+##                                      
+##                Accuracy : 1          
+##                  95% CI : (0.6915, 1)
+##     No Information Rate : 0.6        
+##     P-Value [Acc > NIR] : 0.006047   
+##                                      
+##                   Kappa : 1          
+##                                      
+##  Mcnemar's Test P-Value : NA         
+##                                      
+##             Sensitivity : 1.0        
+##             Specificity : 1.0        
+##          Pos Pred Value : 1.0        
+##          Neg Pred Value : 1.0        
+##              Prevalence : 0.6        
+##          Detection Rate : 0.6        
+##    Detection Prevalence : 0.6        
+##       Balanced Accuracy : 1.0        
+##                                      
+##        'Positive' Class : mammal     
+## 
+

F1 Score

+

precission = TP/(TP+FP) = 1

+

Recall = TP/(TP+FN) = 0.6

+

F1 Score = (2 * precision * Recall)/sum(precision,recall) = 0.75

+

The accuracy of both models are 1 and both have a F1 Score of 0.75 therefore both can be determined to be equivalently good models.

+
+
+

Results

+

Since both the hand generated 1R classifier and the oneR package generated the same results, a 1R classifier based on the milk attibute, and they both have a 100 % accuracy with training and test data, we can feel confident that our model is a strong one.

+

Our Final model is: if has milk = True -> mammal if has milk = False -> non-mammal

+
+
+

References:

+

https://ourcodingclub.github.io/2016/11/24/rmarkdown-1.html#create https://ourcodingclub.github.io/2016/11/13/intro-to-r.html https://data-flair.training/blogs/rstudio-tutorial/ https://www.tutorialspoint.com/r/r_data_frames.htm https://stackoverflow.com/questions/38741997/how-to-solve-the-data-cannot-have-more-levels-than-the-reference-error-when-us https://www.rdocumentation.org/packages/caret/versions/3.45/topics/confusionMatrix https://stats.stackexchange.com/questions/138690/calculate-the-f1-score-of-precision-and-recall-in-r

+
+ + + + +
+ + + + + + + + + + + + + + +