An R6 class for aggregated benchmark results.
Details
This class is used to easily carry out and guide analysis of models after aggregating
the results after resampling. This can either be constructed using mlr3 objects,
for example the result of mlr3::BenchmarkResult$aggregate
or via as_benchmark_aggr,
or by passing in a custom dataset of results. Custom datasets must include at the very least,
a character column for learner ids, a character column for task ids, and numeric columns for
one or more measures.
Currently supported for multiple independent datasets only.
References
Demšar J (2006). “Statistical Comparisons of Classifiers over Multiple Data Sets.” Journal of Machine Learning Research, 7(1), 1-30. https://jmlr.org/papers/v7/demsar06a.html.
Active bindings
data
(data.table::data.table)
Aggregated data.learners
(character())
Unique learner names.tasks
(character())
Unique task names.measures
(character())
Unique measure names.nlrns
(integer())
Number of learners.ntasks
(integer())
Number of tasks.nmeas
(integer())
Number of measures.nrow
(integer())
Number of rows.col_roles
(
character()
)
Column roles, currently cannot be changed after construction.
Methods
Method new()
Creates a new instance of this R6 class.
Usage
BenchmarkAggr$new(
dt,
task_id = "task_id",
learner_id = "learner_id",
independent = TRUE,
strip_prefix = TRUE,
...
)
Arguments
dt
(matrix(1))
Amatrix
like object coercable to data.table::data.table, should include column names "task_id" and "learner_id", and at least one measure (numeric). If ids are not already factors then coerced internally.task_id
(
character(1)
)
String specifying name of task id column.learner_id
(
character(1)
)
String specifying name of learner id column.independent
(logical(1))
Are tasks independent of one another? Affects which tests can be used for analysis.strip_prefix
(
logical(1)
)
IfTRUE
(default) then mlr prefixes, e.g.regr.
,classif.
, are automatically stripped from thelearner_id
....
ANY
Additional arguments, currently unused.
Method print()
Prints the internal data via data.table::print.data.table.
Arguments
...
ANY
Passed to data.table::print.data.table.
Method summary()
Prints the internal data via data.table::print.data.table.
Arguments
...
ANY
Passed to data.table::print.data.table.
Method rank_data()
Ranks the aggregated data given some measure.
Arguments
meas
(character(1))
Measure to rank the data against, should be in$measures
. Can beNULL
if only one measure in data.minimize
(logical(1))
Should the measure be minimized? Default isTRUE
.task
(character(1))
IfNULL
then returns a matrix of ranks where columns are tasks and rows are learners, otherwise returns a one-column matrix of a specified task, should be in$tasks
....
ANY
ANY
Passed todata.table::frank()
.
Method friedman_test()
Computes Friedman test over all tasks, assumes datasets are independent.
Arguments
meas
(character(1))
Measure to rank the data against, should be in$measures
. If no measure is provided then returns a matrix of tests for all measures.p.adjust.method
(character(1))
Passed to p.adjust ifmeas = NULL
for multiple testing correction. IfNULL
then no correction applied.
Method friedman_posthoc()
Posthoc Friedman Nemenyi tests. Computed with
PMCMRplus::frdAllPairsNemenyiTest. If global $friedman_test
is non-significant then
this is returned and no post-hocs computed. Also returns critical difference
Arguments
meas
(character(1))
Measure to rank the data against, should be in$measures
. Can beNULL
if only one measure in data.p.value
(numeric(1))
p.value for which the global test will be considered significant.friedman_global
(
logical(1)
)
Should a friedman global test be performed before conducting the posthoc test? IfFALSE
, a warning is issued in case the corresponding friedman global test fails instead of an error. Default isTRUE
(raises an error if global test fails).
Method subset()
Subsets the data by given tasks or learners. Returns data as data.table::data.table.
Arguments
task
(
character()
)
Task(s) to subset the data by.learner
(
character()
)
Learner(s) to subset the data by.
Examples
# Not restricted to mlr3 objects
df = data.frame(tasks = factor(rep(c("A", "B"), each = 5),
levels = c("A", "B")),
learners = factor(paste0("L", 1:5)),
RMSE = runif(10), MAE = runif(10))
as_benchmark_aggr(df, task_id = "tasks", learner_id = "learners")
#> <BenchmarkAggr> of 10 rows with 2 tasks, 5 learners and 2 measures
#> tasks learners RMSE MAE
#> <fctr> <fctr> <num> <num>
#> 1: A L1 0.080750138 0.87460066
#> 2: A L2 0.834333037 0.17494063
#> 3: A L3 0.600760886 0.03424133
#> 4: A L4 0.157208442 0.32038573
#> 5: A L5 0.007399441 0.40232824
#> 6: B L1 0.466393497 0.19566983
#> 7: B L2 0.497777389 0.40353812
#> 8: B L3 0.289767245 0.06366146
#> 9: B L4 0.732881987 0.38870131
#> 10: B L5 0.772521511 0.97554784
if (requireNamespaces(c("mlr3", "rpart"))) {
library(mlr3)
task = tsks(c("pima", "spam"))
learns = lrns(c("classif.featureless", "classif.rpart"))
bm = benchmark(benchmark_grid(task, learns, rsmp("cv", folds = 2)))
# coercion
as_benchmark_aggr(bm)
}
#> <BenchmarkAggr> of 4 rows with 2 tasks, 2 learners and 1 measure
#> task_id learner_id ce
#> <fctr> <fctr> <num>
#> 1: pima featureless 0.3489583
#> 2: pima rpart 0.2486979
#> 3: spam featureless 0.3940450
#> 4: spam rpart 0.1075876