precision simulation with multi-classification
precision.simulate.multiclass.Rd
Multi-Classification analysis applied on different normalization and study design. Perform the simulation study in Qin et al. (see reference).
Usage
# S3 method for simulate.multiclass
precision(
seed,
N,
biological.effect.tr,
biological.effect.te,
handling.effect.tr,
handling.effect.te,
group.id.tr,
group.id.te,
train.design.met = "NONE",
test.design.met = "NONE",
train.norm.met = "NN",
test.norm.met = "NN",
class.list = c("SVM", "kNN", "LASSO"),
train.batch.id = NULL,
test.batch.id = NULL,
icombat = FALSE,
isva = FALSE,
iruv = FALSE,
biological.effect.tr.ctrl = NULL,
handling.effect.tr.ctrl = NULL,
norm.funcs = NULL,
class.funcs = NULL,
pred.funcs = NULL
)
Arguments
- seed
an integer used to initialize a pseudorandom number generator.
- N
number of simulation runs.
- biological.effect.tr
the training set of the estimated biological effects. This dataset must have rows as probes and columns as samples.
- biological.effect.te
the test set of the estimated biological effects. This dataset must have rows as probes and columns as samples. It must have the same number of probes and the same probe names as the training set of the estimated biological effects.
- handling.effect.tr
the training set of the estimated handling effects. This dataset must have rows as probes and columns as samples. It must have the same dimensions and the same probe names as the training set of the estimated biological effects.
- handling.effect.te
the test set of the estimated handling effects. This dataset must have rows as probes, columns as samples. It must have the same dimensions and the same probe names as the training set of the estimated handling effects.
- group.id.tr
a vector of sample-group labels for each sample of the training set of the estimated biological effects. It must be a 2-level non-numeric factor vector.
- group.id.te
a vector of sample-group labels for each sample of the test set of the estimated biological effects. It must be a 2-level non-numeric factor vector.
- train.design.met
a string for study design to be applied on the training set. The built-in designs are "NONE", CC+", "CC-", "PC+", "PC-", "BLK", and "STR" for "No Rehybridization", "Complete Confounding 1", "Complete Confounding 2", "Partial Confounding 1", "Partial Confounding 2", "Blocking", and "Stratification" in Qin et al.
- test.design.met
a string for study design to be applied on the test set. The built-in designs are "NONE", CC+", "CC-", "PC+", "PC-", "BLK", and "STR" for "No Rehybridization", "Complete Confounding 1", "Complete Confounding 2", "Partial Confounding 1", "Partial Confounding 2", "Blocking", and "Stratification" in Qin et al.
- train.norm.met
a string for normalization method to be applied on the training set. The build-in available normalization methods are "NN", "QN", "MN", "VSN" for "No Normalization", "Quantile Normalization", "Median Normalization", and "Variance Stabilizing Normalization". User can provide a list of normalization methods given the functions are supplied (also see norm.funcs).
- test.norm.met
a string for normalization method to be applied on the test set. The build-in available normalization methods are "NN", "MN", "QN", "fMN", "fQN", "pMN", "pQN", "fVSN", for "No Normalization", "Median Normalization", "Quantile Normalization", "Frozen Median Normalization", "Frozen Quantile Normalization", "Pool Median Normalization", "Pool Quantile Normalization", "Frozen Vairance Stability Normalization". User can provide a list of normalization methods given the functions are supplied (also see norm.funcs).
- class.list
list of strings for classification methods to be compared in the simulation study. The build-in available classfication methods are "PAM", "LASSO", "ClaNC", "ranFor", "SVM", "kNN" and "DLDA", for "Prediction Analysis for Microarrays", "Least Absolute Shrinkage and Selection Operator", "Classification to Nearest Centroids", "Random Forest", "Support Vector Machine", "K-Nearest Neighbors" and "Diagonal Linear Discriminant". User can provide a list of classification methods given the correponding model-building and predicting functions are supplied (also see
class.funcs
andpred.funcs
). You can also use your own classification method by settingclass.met
as "custom". The format to createcustom.intcv
andcustom.predict
please refers to other classification methods in this package.- icombat
an indicator for combat adjustment. By default,
icombat = FALSE
for no ComBat adjustment.- isva
an indicator for sva adjustment. By default,
isva = FALSE
for no sva adjustment.- iruv
an indicator for RUV-4 adjustment. By default,
iruv = FALSE
for no RUV-4 adjustment.- biological.effect.tr.ctrl
the training set of the negative-control probe biological effect data if
iruv = TRUE
. This dataset must have rows as probes and columns as samples. It also must have the same number of samples and the same sample names asbiological.effect.tr
.- handling.effect.tr.ctrl
the training set of the negative-control probe handling effect data if
iruv = TRUE
. This dataset must have rows as probes and columns as samples. It also must have the same dimensions and the same probe names asbiological.effect.tr.ctrl
.- norm.funcs
a list of strings for names of user-defined normalization method functions, in the order of
norm.list
, excluding any built-in normalization methods.- class.funcs
a list of strings for names of user-defined classification model-building functions, in the order of
class.list
, excluding any built-in classification methods.- pred.funcs
a list of strings for names of user-defined classification predicting functions, in the order of
class.list
, excluding any built-in classification methods.- batch.id
a list of array indices grouped by batches when data were profiled. The length of the list must be equal to the number of batches in the data; the number of array indices must be the same as the number of samples. This is required if stratification study design is specified in
design.list
; otherwisebatch.id = NULL
.
Value
simulation study results -- a list of array-to-sample assignments, fitted models, and misclassification error rates across simulation runs:
- assign_store
array-to-sample assignments for the study design, classified by "Train" and "Test"
- model_store
models for the combination of study designs, normalization methods, and classification methods
- error_store
internal and external misclassification error rates for the combination of study designs, normalization methods, and classification methods, classfied by "Train" and "Test"
- ari_store
adjusted rand index of the prediction on test data
Details
The classification anlaysis of simulation study consists of the following main steps:
First, The generation of training and test sets is the same as precision.simultate
.
precision.simulate.multiclass
requires the training and test sets for both estimated biological effects and estimated handling effects.
The effects can be simulated as follows (using estimate.biological.effect
and estimate.handling.effect
).
The uniformly-handled dataset are used to approximate the biological effect for each sample,
and the difference between the two arrays (one from the uniformly-handled dataset and
the other from the nonuniformly-handled dataset, subtracting the former from the latter)
for the same sample are used to approximate the handling effect for each array in the nonuniformly-handled dataset.
The samples are randomly split into a training set and a test set, balanced by tumor type (in Qin et al., training-to-test ratio is 2:1).
The arrays were then non-randomly split to a training set and a test set (in Qin et al., training set n = 128 -- the first 64 and last 64 arrays
in the order of array processing; test set n = 64 -- the middle 64 arrays).
This setup allows different pairings of arrays and samples by various different training-and-test-set splits.
Furthermore, biological signal strength and confounding level of the handling effects can be modified
(using reduce.signal
and amplify.handling.effect
).
Second, apply "virtual re-hybridization" methods (using rehybridize
) on the training and test sets.
There are 6 different methods to choose, besides doing no hybridization.
And it is also allowed to produce different methods on training set and test set.
This is specified in train.design.met
and test.design.met
.
Third, apply normalization on the training and test sets. There are three normalization methods to choose on the training set,
median normalization, quantile normalization and variance stabilizing normalization. With these three methods,
test set can be processed the same method, or corresponding frozen normalization and pool normalization.
Besides, both sets can choose to do no normalization. The normalization methods are specified in train.norm.met
and test.norm.met
. Data preprocessing and batch effects can be adjusted specified with icombat
, isva
and iruv
.
Fourth, choose classfication methods, specified with class.list
. This will fit the chosen classification methods on the training
set, all with a 5-fold cross validation and predict on the test set. The internal and external validation misclassificaiton error estimation,
and the adjusted rand index on the test set will be both included in the output.
For a given split of samples to training set versus test set,
N
datasets will be simulated and analyzed for each array-assignment scheme.
For user-defined normalization method or classification method, please refer to the vignette.
References
Qin LX, Huang HC, Begg CB. Cautionary note on cross validation in molecular classification. Journal of Clinical Oncology. 2016
Examples
if (FALSE) {
set.seed(101)
biological.effect <- estimate.biological.effect(uhdata = uhdata.pl)
handling.effect <- estimate.handling.effect(uhdata = uhdata.pl,
nuhdata = nuhdata.pl)
ctrl.genes <- unique(rownames(uhdata.pl))[grep("NC", unique(rownames(uhdata.pl)))]
biological.effect.nc <- biological.effect[!rownames(biological.effect) %in% ctrl.genes, ]
handling.effect.nc <- handling.effect[!rownames(handling.effect) %in% ctrl.genes, ]
group.id <- substr(colnames(biological.effect.nc), 7, 7)
# randomly split biological effect data into training and test set with
# equal number of endometrial and ovarian samples
redhalf.biological.effect.nc <- reduce.signal(biological.effect = biological.effect.nc, group.id = substr(colnames(biological.effect.nc), 7, 7),group.id.level = c("E", "V"),reduce.multiplier = 1/2)
biological.effect.train.ind <- colnames(biological.effect.nc)[c(sample(which(group.id == "E"), size = 64),
sample(which(group.id == "V"), size = 64))]
biological.effect.test.ind <- colnames(biological.effect.nc)[!colnames(biological.effect.nc) %in% biological.effect.train.ind]
biological.effect.train.test.split =
list("tr" = biological.effect.train.ind,
"te" = biological.effect.test.ind)
# non-randomly split handling effect data into training and test set
handling.effect.train.test.split =
list("tr" = c(1:64, 129:192),
"te" = 65:128)
biological.effect.nc.tr <- biological.effect.nc[, biological.effect.train.ind]
biological.effect.nc.te <- biological.effect.nc[, biological.effect.test.ind]
handling.effect.nc.tr <- handling.effect.nc[, c(1:64, 129:192)]
handling.effect.nc.te <- handling.effect.nc[, 65:128]
# Simulation
precision.result = precision.simulate.class(seed = 0, N = 3,
biological.effect.tr = biological.effect.nc.tr,
biological.effect.te = biological.effect.nc.te,
handling.effect.tr = handling.effect.nc.tr,
handling.effect.te = handling.effect.nc.te,
group.id.tr = substr(colnames(biological.effect.nc.tr), 7, 7),
group.id.te = substr(colnames(biological.effect.nc.te), 7, 7),
train.design.met = "BLK",
test.design.met = "STR",
train.norm.met = "MN",
test.norm.met = "fMN",
class.met = "LASSO",
train.batch.id = list(1:40, 41:64, (129:152)-64, (153:192)-64),
test.batch.id = list((65:80)-64,(81:114)-64,(115:128)-64))
}