Package 'sboost'

Title: Machine Learning with AdaBoost on Decision Stumps
Description: Creates classifier for binary outcomes using Adaptive Boosting (AdaBoost) algorithm on decision stumps with a fast C++ implementation. For a description of AdaBoost, see Freund and Schapire (1997) <doi:10.1006/jcss.1997.1504>. This type of classifier is nonlinear, but easy to interpret and visualize. Feature vectors may be a combination of continuous (numeric) and categorical (string, factor) elements. Methods for classifier assessment, predictions, and cross-validation also included.
Authors: Jadon Wagstaff [aut, cre]
Maintainer: Jadon Wagstaff <[email protected]>
License: MIT + file LICENSE
Version: 0.1.2
Built: 2025-01-25 05:04:26 UTC
Source: https://github.com/jadonwagstaff/sboost

Help Index


sboost Assessment Function

Description

Assesses how well an sboost classifier classifies the data.

Usage

assess(object, features, outcomes, include_scores = FALSE)

Arguments

object

sboost_classifier S3 object output from sboost.

features

feature set data.frame.

outcomes

outcomes corresponding to the features.

include_scores

if true feature_scores are included in output.

Value

An sboost_assessment S3 object containing:

performance

Last row of cumulative statistics (i.e. when all stumps are included in assessment).

cumulative_statistics

stump - the index of the last decision stump added to the assessment.
true_positive - number of true positive predictions.
false_negative - number of false negative predictions.
true_negative - number of true negative predictions.
false_positive - number of false positive predictions.
prevalence - true positive / total.
accuracy - correct predictions / total.
sensitivity - correct predicted positive / true positive.
specificity - correct predicted negative / true negative.
ppv - correct predicted positive / predicted positive.
npv - correct predicted negative / predicted negative.
f1 - harmonic mean of sensitivity and ppv.

feature_scores

If include_scores is TRUE, for each feature in the classifier lists scores for each row in the feature set.

classifier

sboost sboost_classifier object used for assessment.

outcomes

Shows which outcome was considered as positive and which negative.

call

Shows the parameters that were used for assessment.

See Also

sboost documentation.

Examples

# malware
malware_classifier <- sboost(malware[-1], malware[1], iterations = 5, positive = 1)
assess(malware_classifier, malware[-1], malware[1])

# mushrooms
mushroom_classifier <- sboost(mushrooms[-1], mushrooms[1], iterations = 5, positive = "p")
assess(mushroom_classifier, mushrooms[-1], mushrooms[1])

Malware System Calls

Description

System call data for apps identified as malware and not malware.

Usage

malware

Format

A data frame with 7597 rows and 361 variables: outcomes 1 if malware, 0 if not. X1... X360 system calls.

Details

Experimental data generated in this research paper:

M. Dimjašević, S. Atzeni, I. Ugrina, and Z. Rakamarić, "Evaluation of Android Malware Detection Based on System Calls," in Proceedings of the International Workshop on Security and Privacy Analytics (IWSPA), 2016.

Data used for kaggle competition: https://www.kaggle.com/c/ml-fall2016-android-malware

Source

https://zenodo.org/record/154737#.WtoA1IjwaUl


Mushroom Classification

Description

A classic machine learning data set describing hypothetical samples from the Agaricus and Lepiota family.

Usage

mushrooms

Format

A data frame with 7597 rows and 361 variables:

outcomes

p=poisonous, e=edible

cap_shape

bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=s

cap_surface

fibrous=f, grooves=g, scaly=y, smooth=s

cap_color

brown=n, buff=b, cinnamon=c, gray=g, green=r, pink=p, purple=u, red=e, white=w, yellow=y

bruises

bruises=t, no=f

odor

almond=a, anise=l, creosote=c, fishy=y, foul=f, musty=m, none=n, pungent=p, spicy=s

gill_attachment

attached=a, descending=d, free=f, notched=n

gill_spacing

close=c, crowded=w, distant=d

gill_size

broad=b, narrow=n

gill_color

black=k, brown=n, buff=b, chocolate=h, gray=g, green=r, orange=o, pink=p, purple=u, red=e, white=w, yellow=y

stalk_shape

enlarging=e, tapering=t

stalk_root

bulbous=b, club=c, cup=u, equal=e, rhizomorphs=z, rooted=r, missing=?

stalk_surface_above_ring

fibrous=f, scaly=y, silky=k, smooth=s

stalk_surface_below_ring

fibrous=f, scaly=y, silky=k, smooth=s

stalk_color_above_ring

brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y

stalk_color_below_ring

brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y

veil_type

partial=p, universal=u

veil_color

brown=n, orange=o, white=w, yellow=y

ring_number

none=n, one=o, two=t

ring_type

cobwebby=c, evanescent=e, flaring=f, large=l, none=n, pendant=p, sheathing=s, zone=z

spore_print_color

black=k, brown=n, buff=b, chocolate=h, green=r, orange=o, purple=u, white=w, yellow=y

population

abundant=a, clustered=c, numerous=n, scattered=s, several=v, solitary=y

habitat

grasses=g, leaves=l, meadows=m, paths=p, urban=u, waste=w, woods=d

Details

Data gathered from:

Mushroom records drawn from The Audubon Society Field Guide to North American Mushrooms (1981). G. H. Lincoff (Pres.), New York: Alfred A. Knopf

Source

https://archive.ics.uci.edu/ml/datasets/mushroom


Make predictions for a feature set based on an sboost classifier.

Description

Make predictions for a feature set based on an sboost classifier.

Usage

## S3 method for class 'sboost_classifier'
predict(object, features, scores = FALSE, ...)

Arguments

object

sboost_classifier S3 object output from sboost.

features

feature set data.frame.

scores

if true, raw scores generated; if false, predictions are generated.

...

further arguments passed to or from other methods.

Value

Predictions in the form of a vector, or scores in the form of a vector. The index of the vector aligns the predictions or scores with the rows of the features. Scores represent the sum of all votes for the positive outcome minus the sum of all votes for the negative outcome.

See Also

sboost documentation.

Examples

# malware
malware_classifier <- sboost(malware[-1], malware[1], iterations = 5, positive = 1)
predict(malware_classifier, malware[-1], scores = TRUE)
predict(malware_classifier, malware[-1])

# mushrooms
mushroom_classifier <- sboost(mushrooms[-1], mushrooms[1], iterations = 5, positive = "p")
predict(mushroom_classifier, mushrooms[-1], scores = TRUE)
predict(mushroom_classifier, mushrooms[-1])

sboost Learning Algorithm

Description

A machine learning algorithm using AdaBoost on decision stumps.

Usage

sboost(features, outcomes, iterations = 1, positive = NULL, verbose = FALSE)

Arguments

features

feature set data.frame.

outcomes

outcomes corresponding to the features.

iterations

number of boosts.

positive

the positive outcome to test for; if NULL, the first outcome in alphabetical (or numerical) order will be chosen.

verbose

If true, progress bar will be displayed in console.

Details

Factors and characters are treated as categorical features. Missing values are supported.

See https://jadonwagstaff.github.io/projects/sboost.html for a description of the algorithm.

For original paper describing AdaBoost see:

Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55(1), 119-139 (1997)

Value

An sboost_classifier S3 object containing:

classifier

stump - the index of the decision stump
feature - name of the column that this stump splits on.
vote - the weight that this stump has on the final classifier.
orientation - shows how outcomes are split. If feature is numeric shows split orientation, if feature value is less than split then vote is cast in favor of left side outcome, otherwise the vote is cast for the right side outcome. If feature is categorical, vote is cast for the left side outcome if feature value is found in left_categories, otherwise vote is cast for right side outcome.
split - if feature is numeric, the value where the decision stump splits the outcomes; otherwise, NA.
left_categories - if feature is categorical, shows the feature values that sway the vote to the left side outcome on the orientation split; otherwise, NA.

outcomes

Shows which outcome was considered as positive and which negative.

training

stumps - how many decision stumps were trained.
features - how many features the training set contained.
instances - how many instances or rows the training set contained.
positive_prevalence - what fraction of the training instances were positive.

call

Shows the parameters that were used to build the classifier.

See Also

predict.sboost_classifier - to get predictions from the classifier.

assess - to evaluate the performance of the classifier.

validate - to perform cross validation for the classifier training.

Examples

# malware
malware_classifier <- sboost(malware[-1], malware[1], iterations = 5, positive = 1)
malware_classifier
malware_classifier$classifier

# mushrooms
mushroom_classifier <- sboost(mushrooms[-1], mushrooms[1], iterations = 5, positive = "p")
mushroom_classifier
mushroom_classifier$classifier

sboost Validation Function

Description

A k-fold cross validation algorithm for sboost.

Usage

validate(
  features,
  outcomes,
  iterations = 1,
  k_fold = 6,
  positive = NULL,
  verbose = FALSE
)

Arguments

features

feature set data.frame.

outcomes

outcomes corresponding to the features.

iterations

number of boosts.

k_fold

number of cross-validation subsets.

positive

is the positive outcome to test for; if NULL, the first in alphabetical order will be chosen

verbose

If true, progress bars will be displayed in console.

Value

An sboost_validation S3 object containing:

performance

Final performance statistics for all stumps.

training_summary_statistics

Mean and standard deviations for test statistics generated by assess cumulative statistics for each of the training sets.

testing_summary_statistics

Mean and standard deviations for test statistics generated by assess cumulative statistics for each of the testing sets.

training_statistics

sboost sboost_assessment cumulative statistics objects used to generate training_statistics.

testing_statistics

sboost sboost_assessment cumulative statistics objects used to generate testing_statistics.

classifier_list

sboost sboost_classifier objects created from training sets.

outcomes

Shows which outcome was considered as positive and which negative.

k_fold

number of testing and training sets used in the validation.

call

Shows the parameters that were used for validation.

See Also

sboost documentation.

Examples

# malware
validate(malware[-1], malware[1], iterations = 5, k_fold = 3, positive = 1)

# mushrooms
validate(mushrooms[-1], mushrooms[1], iterations = 5, k_fold = 3, positive = "p")