Generates plots for BenchmarkAggr, all assume that there are multiple, independent, tasks.
Choices depending on the argument type
:
"mean"
(default): Assumes there are at least two independent tasks. Plots the sample mean of the measure for all learners with error bars computed with the standard error of the mean."box"
: Boxplots for each learner calculated over all tasks for a given measure."fn"
: Plots post-hoc Friedman-Nemenyi by first calling BenchmarkAggr$friedman_posthoc
and plotting significant pairs in coloured squares and leaving non-significant pairs blank, useful for simply visualising pair-wise comparisons."cd"
: Critical difference plots (Demsar, 2006). Learners are drawn on the x-axis according to their average rank with the best performing on the left and decreasing performance going right. Any learners not connected by a horizontal bar are significantly different in performance. Critical differences are calculated as: $$CD = q_{\alpha} \sqrt{\left(\frac{k(k+1)}{6N}\right)}$$ Where \(q_\alpha\) is based on the studentized range statistic. See references for further details. It's recommended to crop white space using external tools, or functionimage_trim()
from package magick.
Arguments
- object
(BenchmarkAggr)
The benchmark aggregation object.- type
(character(1))
Type of plot, see description.- meas
(character(1))
Measure to plot, should be inobj$measures
, can beNULL
if only one measure is inobj
.- level
(numeric(1))
Confidence level for error bars fortype = "mean"
- p.value
(numeric(1))
What value should be considered significant fortype = "cd"
andtype = "fn"
.- minimize
(logical(1))
Fortype = "cd"
, indicates if the measure is optimally minimized. Default isTRUE
.- test
(
character(1))
)
Fortype = "cd"
, critical differences are either computed between all learners (test = "nemenyi"
), or to a baseline (test = "bd"
). Bonferroni-Dunn usually yields higher power than Nemenyi as it only compares algorithms to one baseline. Default is"nemenyi"
.- baseline
(character(1))
Fortype = "cd"
andtest = "bd"
a baseline learner to compare the other learners to, should be in$learners
, ifNULL
then differences are compared to the best performing learner.- style
(integer(1))
Fortype = "cd"
two ggplot styles are shipped with the package (style = 1
orstyle = 2
), otherwise the data can be accessed via the returned ggplot.- ratio
(
numeric(1)
)
Fortype = "cd"
andstyle = 1
, passed toggplot2::coord_fixed()
, useful for quickly specifying the aspect ratio of the plot, best used withggplot2::ggsave()
.- col
(
character(1)
)
Fortype = "fn"
, specifies color to fill significant tiles, default is"red"
.- friedman_global
(
logical(1)
)
Should a friedman global test be performed fortype = "cd"
andtype = "fn"
? IfFALSE
, a warning is issued in case the corresponding friedman posthoc test fails instead of an error. Default isTRUE
(raises an error if global test fails).- ...
ANY
Additional arguments, currently unused.
References
Demšar J (2006). “Statistical Comparisons of Classifiers over Multiple Data Sets.” Journal of Machine Learning Research, 7(1), 1-30. https://jmlr.org/papers/v7/demsar06a.html.
Examples
if (requireNamespaces(c("mlr3learners", "mlr3", "rpart", "xgboost"))) {
library(mlr3)
library(mlr3learners)
library(ggplot2)
set.seed(1)
task = tsks(c("iris", "sonar", "wine", "zoo"))
learns = lrns(c("classif.featureless", "classif.rpart", "classif.xgboost"))
learns$classif.xgboost$param_set$values$nrounds = 50
bm = benchmark(benchmark_grid(task, learns, rsmp("cv", folds = 3)))
obj = as_benchmark_aggr(bm)
# mean and error bars
autoplot(obj, type = "mean", level = 0.95)
if (requireNamespace("PMCMRplus", quietly = TRUE)) {
# critical differences
autoplot(obj, type = "cd",style = 1)
autoplot(obj, type = "cd",style = 2)
# post-hoc friedman-nemenyi
autoplot(obj, type = "fn")
}
}