|
BIOSTATISTICS FALL FORUM 2010
February 17, 2010
by: Kevin Dobbin
Department of Epidemiology and Biostatistics
University of Georgia
"Optimally splitting cases for training and testing high dimensional microarray classifiers"
Abstract: We consider the problem of designing a study to develop a predictive
classifier from high dimensional data. A common study design is to split the sample into a training set and an independent validation set, where the former is used to develop the classifier and the latter to evaluate its performance. In this paper we address the question of what proportion of the samples should be devoted to the training set. How does this proportion impact the mean squared error of the prediction accuracy estimate? Using a model-based simulation we investigate the range of possible splits for a wide variety of gene expression distributions. We also develop a data-based non-parametric approach that can be applied with a specific dataset. These methods are based on a decomposition of the MSE into three intuitive component parts. By applying these approaches to a number of synthetic and real microarray datasets we show that in most settings 40% to 80% of the samples should be devoted to the training set. The optimal proportion depends on the overall number of samples available, number of differentially expressed genes, and the standardized fold change for informative genes. Over a wide range of settings, it was found that 2/3-to1/3 training-to-validation allocation performs nearly as well as the optimal split, and is more robust than 1/2-to-1/2 allocation. A resampling approach that can be applied to any dataset, using any predictor development method, to determine the best split is presented.
|