Go to USC home page USC Logo Department for Epidemiology and Biostatistics
UNIVERSITY OF SOUTH CAROLINA

Arnold School of Public Health

Department Home

Epidemiology

Biostatistics


HSRC

Summer Institute

Biostatistics Forum

Statistics Colloquiumthis link leaves the department web site

Greeting

Mission Statement

Degrees

Faculty & Staff

Research Activities

Current Students

Prospective Students

Visit Us

BIOSTATISTICS FALL FORUM 2010

February 17, 2010

by: Kevin Dobbin
Department of Epidemiology and Biostatistics
University of Georgia

"Optimally splitting cases for training and testing high dimensional microarray classifiers"

Abstract: We consider the problem of designing a study to develop a predictive classifier from high dimensional data. A common study design is to split the sample into a training set and an independent validation set, where the former is used to develop the classifier and the latter to evaluate its performance. In this paper we address the question of what proportion of the samples should be devoted to the training set. How does this proportion impact the mean squared error of the prediction accuracy estimate? Using a model-based simulation we investigate the range of possible splits for a wide variety of gene expression distributions. We also develop a data-based non-parametric approach that can be applied with a specific dataset. These methods are based on a decomposition of the MSE into three intuitive component parts. By applying these approaches to a number of synthetic and real microarray datasets we show that in most settings 40% to 80% of the samples should be devoted to the training set. The optimal proportion depends on the overall number of samples available, number of differentially expressed genes, and the standardized fold change for informative genes. Over a wide range of settings, it was found that 2/3-to1/3 training-to-validation allocation performs nearly as well as the optimal split, and is more robust than 1/2-to-1/2 allocation. A resampling approach that can be applied to any dataset, using any predictor development method, to determine the best split is presented.

 

 

 

 

 

 

 

 

 

 


USC  THIS SITE
RETURN TO TOP
DIRECTORY MAP EVENTS VIP
SITE INFORMATION