Energy-I-Score: First Steps
Example_IScore.RmdIntroduction
The Iscores package provides tools for evaluating
imputation methods using Imputation Scores (IScores). In particular, the
package implements the DR I-Score and energy-I-Score, which measure the
quality of imputed datasets by comparing the relationships between
observed and imputed values. The methodology is described in detail in
Näf et al. (2022) and Näf et al. (2025).
The package supports:
- numerical datasets,
- mixed datasets containing both numerical and categorical variables,
- evaluation of a single imputation method,
- comparison of multiple imputation methods.
The main functions are:
-
energy_IScore()— calculates the Energy-I-Score for a single imputation method, -
compare_Iscores()— compares several imputation methods using selected IScores.
This vignette presents the basic workflow for computing Imputation Scores and comparing imputation approaches.
Installation
The stable version of the package can be installed from CRAN (soon):
install.packages("Iscores")The development version can be installed from GitHub:
if (!requireNamespace("devtools", quietly = TRUE)) {
install.packages("devtools")
}
devtools::install_github("missValTeam/Iscores")After installation, load the package and set a random seed to ensure reproducibility:
Preparing data
The package expects a data.frame containing missing
values represented as NA.
Numerical variables should be stored as numeric vectors. Categorical variables should be stored as factors if they are intended to be treated as categorical during score calculation.
For demonstration purposes, we use some randomly generated data with MCAR missings:
set.seed(10)
X <- Iscores:::random_mcar_data(100, 4)
head(X)
#> X1 X2 X3 X4
#> 1 0.01874617 -0.7618043 1.2155138 1.5025446
#> 2 -0.18425254 0.4193754 0.3308765 NA
#> 3 -1.37133055 NA 1.3902751 -0.6306855
#> 4 NA 0.7115740 0.8720470 0.7923495
#> 5 0.29454513 NA -1.0808170 NA
#> 6 0.38979430 0.5631747 0.4958216 0.3227550Imputation Function
Before computing an Imputation Score, we first need to define the
imputation method that will be applied to the
incomplete dataset. The Iscores package is flexible and
allows the user to evaluate any imputation approach, provided that the
imputation function satisfies a few simple requirements.
Your imputation function:
- Must accept a dataset with missing values as its first argument.
- Must return a completed dataset with the same dimensions as the input data.
- Should return a dataset without missing values.
- Can represent:
- a simple custom imputation strategy,
- a wrapper around an external function, package or programming language.
For example, let’s define zero imputation below:
The function can now be passed directly to
energy_IScore(), DR_IScore(), or
compare_Iscores().
Calculating the energy-I-Score
Once an imputation function has been defined, we can evaluate it
using energy_IScore().
In the example below, we calculate the energy-I-Score for the
impute_zero() method.
sc <- energy_IScore(X = X, imputation_func = impute_zero)
sc
#> [1] 0.799712
#> attr(,"dat")
#> column_id weight score n_columns_used
#> X1 1 0.1771 0.7738732 1
#> X3 3 0.1771 0.7737987 1
#> X2 2 0.1539 0.7941482 1
#> X4 4 0.1411 0.8707366 1The result is a single weighted score summarizing the imputation performance across all variables containing missing values.
In addition, detailed information for each variable is returned as an attribute of the result. This table contains:
- the variable-specific scores,
- aggregation weights,
- the number of variables used during training.
The table can be accessed using attr():
attr(sc, "dat")
#> column_id weight score n_columns_used
#> X1 1 0.1771 0.7738732 1
#> X3 3 0.1771 0.7737987 1
#> X2 2 0.1539 0.7941482 1
#> X4 4 0.1411 0.8707366 1Note that the weighted score is simply a weighted mean of scores for particular variables:
sum(attr(sc, "dat")[["score"]] * attr(sc, "dat")[["weight"]]) / sum(attr(sc, "dat")[["weight"]])
#> [1] 0.799712Important parameters
The energy_IScore() function exposes several parameters
that control the scoring procedure.
Number of imputations: N
The parameter N controls how many times the missing part
is re-imputed during score estimation.
This parameter is mainly relevant for stochastic or multiple
imputation methods. For deterministic methods, setting
multiple = FALSE automatically forces
N = 1.
energy_IScore(X = X, imputation_func = impute_zero, N = 5)
#> [1] 0.799712
#> attr(,"dat")
#> column_id weight score n_columns_used
#> X1 1 0.1771 0.7738732 1
#> X3 3 0.1771 0.7737987 1
#> X2 2 0.1539 0.7941482 1
#> X4 4 0.1411 0.8707366 1Note that hen
N = 1, or when the imputation method always returns the same completed dataset, the predictive distribution is effectively represented by a single point estimate. In such cases, the energy-I-Score is computed from a degenerate empirical distribution, which may provide a less reliable approximation of uncertainty. Consequently, the score naturally favors imputation methods that generate realistic variability and sample well from the conditional distribution of the missing values, rather than methods that always return fixed imputations.
Limiting the number of scored variables:
max_length
For datasets with many incomplete variables, score computation may become time-consuming.
The max_length argument allows the user to limit the
number of variables used during score calculation. Variables with the
largest number of missing values are selected first.
By default, max_length = NULL, meaning that all
incomplete variables are included.
energy_IScore(X = X, imputation_func = impute_zero, max_length = 2)
#> [1] 0.773836
#> attr(,"dat")
#> column_id weight score n_columns_used
#> X1 1 0.1771 0.7738732 1
#> X3 3 0.1771 0.7737987 1Handling incomplete training sets: skip_if_needed
Some variables may not have enough fully observed predictors available for training.
In such situations:
skip_if_needed = TRUE(default) removes a minimal number of observations to construct a valid training set,skip_if_needed = FALSEreturnsNAfor variables where no complete predictors can be identified.
Energy-I-Score for mixed data
As mentioned earlier, categorical variables must be stored as factors
in order to be handled correctly by energy_IScore(). If the
input data contains at least one factor variable,
energy_IScore() automatically switches to the mixed-data
version of the score. Therefore, users do not need to call a separate
function for mixed datasets.
Below we construct a simple mixed dataset containing both numerical and categorical variables. Missing values are generated according to the MCAR mechanism.
set.seed(10)
X_cat <- Iscores:::random_mcar_mixed_data(100, 4)
head(X_cat)
#> col1 col2 col3 col4 col5
#> 1 0.01874617 -0.7618043 NA NA 1
#> 2 -0.18425254 NA 0.3308765 0.5904095 3
#> 3 -1.37133055 -1.0399434 1.3902751 -0.6306855 <NA>
#> 4 -0.59916772 0.7115740 0.8720470 0.7923495 4
#> 5 0.29454513 NA NA 0.1253846 <NA>
#> 6 0.38979430 0.5631747 0.4958216 NA 3We use a simple median/mode imputation function. Numerical variables are imputed with the median, while factor variables are imputed with the most frequent category.
impute_mean_mode <- Iscores:::median_mode_imputationThe score can be calculated with the same public function as for numerical data:
energy_IScore(X = X_cat, imputation_func = impute_mean_mode)
#> Factor variables detected.
#> Calculating the energy-I-Score for mixed data.
#> [1] 0.8155521
#> attr(,"dat")
#> column_id weight score n_columns_used
#> col3 3 0.1924 0.7758805 1
#> col1 1 0.1771 0.7280336 1
#> col5 5 0.1539 0.9428090 1
#> col2 2 0.1411 0.8319434 1
#> col4 4 0.1411 0.8243028 1Internally, categorical variables are transformed using one-hot encoding and the score is then computed through multivariate energy distances.
For additional implementation details and methodological discussion, see the vignette “Energy-I-Score: Implementation Details”.
Calculating the DR-I-Score
The package also provides the DR_IScore() function based
on density-ratio estimation and random projection forests.
sc_dr <- DR_IScore(X = X,
imputation_func = impute_zero,
m = 3,
n_proj = 10,
n_trees_per_proj = 2,
n_cores = 1)
sc_dr
#> [1] -8.327919Unlike energy_IScore(), which evaluates predictive
distributions through scoring rules, DR_IScore() compares
the distributions of observed and imputed data using projected random
forests and density-ratio estimation.
The parameter m controls the number of imputed datasets
generated by the imputation method. Increasing m may
improve score stability for stochastic imputation procedures.
Parameters
The parameters n_proj and n_trees_per_proj
control the complexity of the random projection forests used
internally:
- larger
n_projincreases the number of random projections considered, - larger
n_trees_per_projincreases the number of trees grown for each projection.
Increasing these parameters may improve stability and precision of the score, but also increases computational cost.
Comparing multiple imputation methods
The compare_Iscores() function can be used to compare
several imputation methods simultaneously.
Below we define two additional imputation strategies from
mice package.
library(mice)
impute_mice_norm <- function(X) {
imp <- mice(X, m = 1, method = "norm", maxit = 5, printFlag = FALSE)
complete(imp)
}
impute_mice_rf <- function(X) {
imp <- mice(X, m = 1, method = "rf", maxit = 5, printFlag = FALSE)
complete(imp)
}Now we place the methods in a named list:
methods_list <- list(zero = impute_zero,
mice_norm = impute_mice_norm,
mice_rf = impute_mice_rf)We can now compare the methods using the energy-I-Score:
sc_comparison <- compare_Iscores(X = X,
methods_list = methods_list,
score = "energy_IScore",
N = 10,
silent = TRUE)
#> Calculating the energy_IScore for method zero ...
#> Calculating the energy_IScore for method mice_norm ...
#> Calculating the energy_IScore for method mice_rf ...
sc_comparison
#> score score_name method
#> 1 0.7997120 energy_IScore zero
#> 2 0.6247271 energy_IScore mice_norm
#> 3 0.6839281 energy_IScore mice_rfThe resulting data frame contains one row per imputation method.
Comparing methods using multiple scores
The package also allows simultaneous comparison using multiple scoring rules.
comparison_all <- compare_Iscores(X = X,
methods_list = methods_list,
score = c("energy_IScore", "DR_IScore"),
N = 10,
m = 3,
n_proj = 10,
n_trees_per_proj = 2,
silent = TRUE)
#> Calculating the energy_IScore for method zero ...
#> Calculating the energy_IScore for method mice_norm ...
#> Calculating the energy_IScore for method mice_rf ...
#> Calculating the DR_IScore for method zero ...
#> Calculating the DR_IScore for method mice_norm ...
#> Calculating the DR_IScore for method mice_rf ...
comparison_all
#> score score_name method
#> 1 0.7997120 energy_IScore zero
#> 2 0.6473443 energy_IScore mice_norm
#> 3 0.6753519 energy_IScore mice_rf
#> 4 -9.2208219 DR_IScore zero
#> 5 1.4280980 DR_IScore mice_norm
#> 6 2.4883888 DR_IScore mice_rfWhen multiple scores are requested, additional arguments passed to
compare_Iscores() are automatically forwarded to the
corresponding scoring functions.
In the example above:
-
Nandsilentare arguments used byenergy_IScore(), -
m,n_proj, andn_trees_per_projare arguments used byDR_IScore().
Therefore, when combining several scoring rules, users should provide the parameters required for each selected score.
Summary of best practices
Use
multiple = TRUEwith a genuinely multiple imputers and determineNfor stable estimates.Consider
scale = TRUEwhen mixing variables on different scales.Use
max_lengthfor quick experiments; remove it for final runs.Keep
skip_if_needed = TRUEunless you explicitly want to flag unscorable columns with NA.
Energy score
If you have access to the original dataset before imputation, you can
also use the energy distance as an additional evaluation metric. Our
package provides an easy-to-use wrapper edistance() around
the energy::eqdist.e function from the energy package.