Classification analysis of uniformly-handled data
uni.handled.simulate.Rd
Perform classification analysis on the uniformly-handled data by re-assigning samples to training and test set. More details can be found in Qin et al. (see reference).
Arguments
- seed
an integer used to initialize a pseudorandom number generator.
- N
number of simulation runs.
- biological.effect
the estimated biological effect dataset. This dataset must have rows as probes and columns as samples.
- norm.list
a list of strings for normalization methods to be compared in the simulation study. The built-in normalization methods includes "NN", "QN", "MN", "VSN" for "No Normalization", "Quantile Normalization", "Median Normalization", "Variance Stabilizing Normalization". User can provide a list of normalization methods given the functions are supplied (also see
norm.funcs
).- class.list
a list of strings for classification methods to be compared in the simulation study. The built-in classification methods are "PAM" and "LASSO" for "prediction analysis for microarrays" and "least absolute shrinkage and selection operator". User can provide a list of classification methods given the correponding model-building and predicting functions are supplied (also see
class.funcs
andpred.funcs
).- norm.funcs
a list of strings for names of user-defined normalization method functions, in the order of
norm.list
, excluding any built-in normalization methods.- class.funcs
a list of strings for names of user-defined classification model-building functions, in the order of
class.list
, excluding any built-in classification methods.- pred.funcs
a list of strings for names of user-defined classification predicting functions, in the order of
class.list
, excluding any built-in classification methods.
Value
benchmark analysis results -- a list of training-and-test-set splits, fitted models, and misclassification error rates across simulation runs:
- assign_store
random training-and-test-set splits
- model_store
models for each combination of normalization methods and classification methods
- error_store
internal and external misclassification error rates for each combination of normalization methods and classification methods
Details
The analysis for the uniformly-handled dataset consists of the following main steps:
(1) randomly split the data into a training set and a test set, balanced by sample group of interest
(2) preprocess the training data and the test data
(3) build a classifier using the preprocessed training data
(4) assess the mislcassification error rate of the classifier using the preprocessed test data
This analysis is repeated for N
random splits of training set and test set.
Data preprocessing in (2) includes three steps: log2 transformation, normalization for training data
and frozen normalization for test data,
and probe-set summarization using median. Normalization methods are specified in norm.list
.
Classifier building in (3) includes choosing the tuning parameter for each method using five-fold cross-validation and
measuring classifier accuarcy using the misclassification error rate.
Classification methods are specified in class.list
The error rate is evaluated by both external validation of test data and cross-validation of training data. For user-defined normalization method or classification method, please refer to the vignette.
References
Qin LX, Huang HC, Begg CB. Cautionary note on cross validation in molecular classification. Journal of Clinical Oncology. 2016.
Examples
if (FALSE) {
biological.effect <- estimate.biological.effect(uhdata = uhdata.pl)
ctrl.genes <- unique(rownames(uhdata.pl))[grep("NC", unique(rownames(uhdata.pl)))]
biological.effect.nc <- biological.effect[!rownames(biological.effect) %in% ctrl.genes, ]
uni.handled.results <- uni.handled.simulate(seed = 1, N = 3,
biological.effect = biological.effect.nc,
norm.list = c("NN", "QN"),
class.list = c("PAM", "LASSO"))
}