Conference on High-Dimensional Statistics

High-Dimensional Statistics has grown out of modern research activities in diverse fields such as science, technology, and business, aided by powerful computing. It encompasses several emerging fields in statistics such as high-dimensional inference, dimension reduction, data mining, machine learning, and bioinformatics.

Mark van der Laan

Title of Talk: Targeted Learning with High Dimensional Data

Abstract: Learning from data involves defining 1) the experiment that generated the data, 2) the target parameter of the data generating distribution that we want to learn, the so called estimand, 3) the collection of possible data generating distributions, the so called statistical model, and 4) its possible parameterization in terms of underlying distributions, often involving non-testable assumptions, giving the so called model. The statistical model represents our statistical knowledge and should be defined so that it contains the true data generating distribution. The statistical estimation problem is now defined by the target parameter and statistical model. Realistic estimation problems thus involve learning a target parameter in very large semiparametric models for often very high dimensional data structures. Classical methods such as maximum likelihood based estimation, though optimal for small semiparametric models, break down for such large semiparametric models.

In response to this we developed targeted maximum likelihood estimation, and its natural generalization, targeted minimum loss based estimation (TMLE), as a new template for construction of semiparametric efficient estimators of pathwise differentiable target parameters. It involves defining an initial estimator of the relevant part of the data generating distribution, allowing the integration of the art in ensemble learning fully utilizing the power of cross-validation, and a targeted bias reduction step defined by a least favorable parametric submodel through the initial estimator, and a loss function to estimate the amount of fluctuation. The estimator of the target parameter is now the plug-in estimator corresponding with this updated initial estimator. TMLE results in semiparametric efficient (often robust w.r.t. various misspecifications) substitution estimators.

In our talk we will present this template, and demonstrate it in the assessment of effects of genomic variables on an outcome of interest, controlling for the other variables, and the assessment of causal effects of interventions on an outcome of interest in experimental and observational studies, such as in a typical safety analysis or comparative effectiveness analysis.