Introduction
This vignette presents the implementation details of the energy-I-Score, a metric designed to evaluate the quality of imputation methods in incomplete datasets.
The score is based on the concept of energy distance between observed and imputed distributions. It allows comparing the uncertainty induced by the imputation model with the variability present in the observed data. The procedure is model-agnostic: it can be used with any imputation method and with multiple imputation draws.
The score is distribution-free and can be applied to:
- continuous variables,
- mixed-type data (here the score is calculated on dummy variables),
- multiple imputation methods.
Notation
Let be an original dataset with missing values, be an imputed dataset, imputation function, and the number of imputations drawn from .
Then, for each variable with missing values we define as a set of indices for which is observed, being a set of indices for which is missing and a set of fully observed predictor variables for rows with observed.
Finally, we define the set of variables with missing values as .
Algorithm Overview
The energy-I-Score is computed iteratively for each variable with missing data. The following steps are performed for each .
Step 1: Selection of Predictor Set
We determine the set of predictor variables:
If is empty, the algorithm automatically selects a fallback variable defined as: which is a variable with the largest number of observed values for the observed part of column . This ensures that the imputation model has at least one predictor.
Step 2: Data Partitioning
The data are split into training and test sets as follows:
- The training set contains the observed predictor values and missing target values to be imputed.
- The test set contains the observed target values to evaluate imputation quality.
Step 3: Multiple Imputations
The missing part of the training set is imputed times using :
Each imputation represents a draw from the conditional distribution of the missing variable given the observed predictors.
Step 4: Energy Distance Calculation
For each , the energy-I-Score component is computed as:
The first term is internal dispersion of the imputed values and the second term is distance between the imputed and the actual observations. The larger the score, the greater the uncertainty of the imputation relative to the true data.
Practical Interpretation
High values of the score suggest large variability or poor alignment between imputed and observed distributions.
Low values indicate imputations that are close to the observed data distribution (better performance).
Variables with few missing values have lower weight, while those with many missing values contribute more.
Methods that do not rely on multiple imputation or have a weak/random draw mechanism tend to perform worse, because they underestimate the uncertainty of the missing values.
The energy-I-Score should primarily be used to rank different imputation methods, rather than to interpret its absolute numeric value directly.