Title: | Machine Learning with AdaBoost on Decision Stumps |
---|---|
Description: | Creates classifier for binary outcomes using Adaptive Boosting (AdaBoost) algorithm on decision stumps with a fast C++ implementation. For a description of AdaBoost, see Freund and Schapire (1997) <doi:10.1006/jcss.1997.1504>. This type of classifier is nonlinear, but easy to interpret and visualize. Feature vectors may be a combination of continuous (numeric) and categorical (string, factor) elements. Methods for classifier assessment, predictions, and cross-validation also included. |
Authors: | Jadon Wagstaff [aut, cre] |
Maintainer: | Jadon Wagstaff <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.2 |
Built: | 2025-01-25 05:04:26 UTC |
Source: | https://github.com/jadonwagstaff/sboost |
Assesses how well an sboost classifier classifies the data.
assess(object, features, outcomes, include_scores = FALSE)
assess(object, features, outcomes, include_scores = FALSE)
object |
sboost_classifier S3 object output from sboost. |
features |
feature set data.frame. |
outcomes |
outcomes corresponding to the features. |
include_scores |
if true feature_scores are included in output. |
An sboost_assessment S3 object containing:
Last row of cumulative statistics (i.e. when all stumps are included in assessment).
stump - the index of the last decision stump added to the assessment.
true_positive - number of true positive predictions.
false_negative - number of false negative predictions.
true_negative - number of true negative predictions.
false_positive - number of false positive predictions.
prevalence - true positive / total.
accuracy - correct predictions / total.
sensitivity - correct predicted positive / true positive.
specificity - correct predicted negative / true negative.
ppv - correct predicted positive / predicted positive.
npv - correct predicted negative / predicted negative.
f1 - harmonic mean of sensitivity and ppv.
If include_scores is TRUE, for each feature in the classifier lists scores for each row in the feature set.
sboost sboost_classifier object used for assessment.
Shows which outcome was considered as positive and which negative.
Shows the parameters that were used for assessment.
sboost
documentation.
# malware malware_classifier <- sboost(malware[-1], malware[1], iterations = 5, positive = 1) assess(malware_classifier, malware[-1], malware[1]) # mushrooms mushroom_classifier <- sboost(mushrooms[-1], mushrooms[1], iterations = 5, positive = "p") assess(mushroom_classifier, mushrooms[-1], mushrooms[1])
# malware malware_classifier <- sboost(malware[-1], malware[1], iterations = 5, positive = 1) assess(malware_classifier, malware[-1], malware[1]) # mushrooms mushroom_classifier <- sboost(mushrooms[-1], mushrooms[1], iterations = 5, positive = "p") assess(mushroom_classifier, mushrooms[-1], mushrooms[1])
System call data for apps identified as malware and not malware.
malware
malware
A data frame with 7597 rows and 361 variables: outcomes 1 if malware, 0 if not. X1... X360 system calls.
Experimental data generated in this research paper:
M. Dimjašević, S. Atzeni, I. Ugrina, and Z. Rakamarić, "Evaluation of Android Malware Detection Based on System Calls," in Proceedings of the International Workshop on Security and Privacy Analytics (IWSPA), 2016.
Data used for kaggle competition: https://www.kaggle.com/c/ml-fall2016-android-malware
https://zenodo.org/record/154737#.WtoA1IjwaUl
A classic machine learning data set describing hypothetical samples from the Agaricus and Lepiota family.
mushrooms
mushrooms
A data frame with 7597 rows and 361 variables:
p=poisonous, e=edible
bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=s
fibrous=f, grooves=g, scaly=y, smooth=s
brown=n, buff=b, cinnamon=c, gray=g, green=r, pink=p, purple=u, red=e, white=w, yellow=y
bruises=t, no=f
almond=a, anise=l, creosote=c, fishy=y, foul=f, musty=m, none=n, pungent=p, spicy=s
attached=a, descending=d, free=f, notched=n
close=c, crowded=w, distant=d
broad=b, narrow=n
black=k, brown=n, buff=b, chocolate=h, gray=g, green=r, orange=o, pink=p, purple=u, red=e, white=w, yellow=y
enlarging=e, tapering=t
bulbous=b, club=c, cup=u, equal=e, rhizomorphs=z, rooted=r, missing=?
fibrous=f, scaly=y, silky=k, smooth=s
fibrous=f, scaly=y, silky=k, smooth=s
brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y
brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y
partial=p, universal=u
brown=n, orange=o, white=w, yellow=y
none=n, one=o, two=t
cobwebby=c, evanescent=e, flaring=f, large=l, none=n, pendant=p, sheathing=s, zone=z
black=k, brown=n, buff=b, chocolate=h, green=r, orange=o, purple=u, white=w, yellow=y
abundant=a, clustered=c, numerous=n, scattered=s, several=v, solitary=y
grasses=g, leaves=l, meadows=m, paths=p, urban=u, waste=w, woods=d
Data gathered from:
Mushroom records drawn from The Audubon Society Field Guide to North American Mushrooms (1981). G. H. Lincoff (Pres.), New York: Alfred A. Knopf
https://archive.ics.uci.edu/ml/datasets/mushroom
Make predictions for a feature set based on an sboost classifier.
## S3 method for class 'sboost_classifier' predict(object, features, scores = FALSE, ...)
## S3 method for class 'sboost_classifier' predict(object, features, scores = FALSE, ...)
object |
sboost_classifier S3 object output from sboost. |
features |
feature set data.frame. |
scores |
if true, raw scores generated; if false, predictions are generated. |
... |
further arguments passed to or from other methods. |
Predictions in the form of a vector, or scores in the form of a vector. The index of the vector aligns the predictions or scores with the rows of the features. Scores represent the sum of all votes for the positive outcome minus the sum of all votes for the negative outcome.
sboost
documentation.
# malware malware_classifier <- sboost(malware[-1], malware[1], iterations = 5, positive = 1) predict(malware_classifier, malware[-1], scores = TRUE) predict(malware_classifier, malware[-1]) # mushrooms mushroom_classifier <- sboost(mushrooms[-1], mushrooms[1], iterations = 5, positive = "p") predict(mushroom_classifier, mushrooms[-1], scores = TRUE) predict(mushroom_classifier, mushrooms[-1])
# malware malware_classifier <- sboost(malware[-1], malware[1], iterations = 5, positive = 1) predict(malware_classifier, malware[-1], scores = TRUE) predict(malware_classifier, malware[-1]) # mushrooms mushroom_classifier <- sboost(mushrooms[-1], mushrooms[1], iterations = 5, positive = "p") predict(mushroom_classifier, mushrooms[-1], scores = TRUE) predict(mushroom_classifier, mushrooms[-1])
A machine learning algorithm using AdaBoost on decision stumps.
sboost(features, outcomes, iterations = 1, positive = NULL, verbose = FALSE)
sboost(features, outcomes, iterations = 1, positive = NULL, verbose = FALSE)
features |
feature set data.frame. |
outcomes |
outcomes corresponding to the features. |
iterations |
number of boosts. |
positive |
the positive outcome to test for; if NULL, the first outcome in alphabetical (or numerical) order will be chosen. |
verbose |
If true, progress bar will be displayed in console. |
Factors and characters are treated as categorical features. Missing values are supported.
See https://jadonwagstaff.github.io/projects/sboost.html for a description of the algorithm.
For original paper describing AdaBoost see:
Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55(1), 119-139 (1997)
An sboost_classifier S3 object containing:
stump - the index of the decision stump
feature - name of the column that this stump splits on.
vote - the weight that this stump has on the final classifier.
orientation - shows how outcomes are split. If feature is numeric
shows split orientation, if feature value is less than split then vote
is cast in favor of left side outcome, otherwise the vote is cast for the
right side outcome. If feature is categorical, vote is
cast for the left side outcome if feature value is found in
left_categories, otherwise vote is cast for right side outcome.
split - if feature is numeric, the value where the decision stump
splits the outcomes; otherwise, NA.
left_categories - if feature is categorical, shows the feature
values that sway the vote to the left side outcome on the orientation split;
otherwise, NA.
Shows which outcome was considered as positive and which negative.
stumps - how many decision stumps were trained.
features - how many features the training set contained.
instances - how many instances or rows the training set contained.
positive_prevalence - what fraction of the training instances were positive.
Shows the parameters that were used to build the classifier.
predict.sboost_classifier
- to get predictions from the classifier.
assess
- to evaluate the performance of the classifier.
validate
- to perform cross validation for the classifier training.
# malware malware_classifier <- sboost(malware[-1], malware[1], iterations = 5, positive = 1) malware_classifier malware_classifier$classifier # mushrooms mushroom_classifier <- sboost(mushrooms[-1], mushrooms[1], iterations = 5, positive = "p") mushroom_classifier mushroom_classifier$classifier
# malware malware_classifier <- sboost(malware[-1], malware[1], iterations = 5, positive = 1) malware_classifier malware_classifier$classifier # mushrooms mushroom_classifier <- sboost(mushrooms[-1], mushrooms[1], iterations = 5, positive = "p") mushroom_classifier mushroom_classifier$classifier
A k-fold cross validation algorithm for sboost.
validate( features, outcomes, iterations = 1, k_fold = 6, positive = NULL, verbose = FALSE )
validate( features, outcomes, iterations = 1, k_fold = 6, positive = NULL, verbose = FALSE )
features |
feature set data.frame. |
outcomes |
outcomes corresponding to the features. |
iterations |
number of boosts. |
k_fold |
number of cross-validation subsets. |
positive |
is the positive outcome to test for; if NULL, the first in alphabetical order will be chosen |
verbose |
If true, progress bars will be displayed in console. |
An sboost_validation S3 object containing:
Final performance statistics for all stumps.
Mean and standard deviations for test statistics
generated by assess
cumulative statistics for each of the training sets.
Mean and standard deviations for test statistics
generated by assess
cumulative statistics for each of the testing sets.
sboost sboost_assessment cumulative statistics objects used to generate training_statistics.
sboost sboost_assessment cumulative statistics objects used to generate testing_statistics.
sboost sboost_classifier objects created from training sets.
Shows which outcome was considered as positive and which negative.
number of testing and training sets used in the validation.
Shows the parameters that were used for validation.
sboost
documentation.
# malware validate(malware[-1], malware[1], iterations = 5, k_fold = 3, positive = 1) # mushrooms validate(mushrooms[-1], mushrooms[1], iterations = 5, k_fold = 3, positive = "p")
# malware validate(malware[-1], malware[1], iterations = 5, k_fold = 3, positive = 1) # mushrooms validate(mushrooms[-1], mushrooms[1], iterations = 5, k_fold = 3, positive = "p")