Lecture Questions

(Q1) When you suggest methods to deal with missing values to users, the recurrent question is “What is the percentage of missing values which is acceptable in my data set? Is 50% too much but 20% OK?” What is your answer to this question?

The percentage of missing values is not the only thing which is important. If the variables are highly correlated, we can predict the missing values precisely even with a high fraction of missing values. On the contrary, if the data set is very noisy to begin with, even a small fraction of missing values can be troublesome. Multiple imputation can be performed and enables to measure precisely the variability of the predictions, which evaluates how much we can trust the results obtained from a (very) incomplete dataset.

(Q2) Explain the aims of multiple imputation in comparison to single imputation.

Single imputation leads to underestimating the variability of the parameters estimators because it does not account for the variability due to missing values. Multiple imputation aims at providing an estimation of the parameters of interest as well as their variability, taking into account the variability due to missing values.

(Q3) Your aim is to impute a data set (predict as precisely as possible missing values). You have 3 imputation methods available. How could you compare them? **

You can use a cross-validation strategy: you remove some available entries, you predict them with the 3 imputation methods and you compute the errors of prediction. You repeat this procedure say K-times. You select the methods which minimizes the prediction error.

Continuous data with missing values - Regression with missing data via Multiple Imputation

First of all you will need to install the following packages

install.packages("VIM")
install.packages("naniar")
install.packages("missMDA")
install.packages("Amelia")
install.packages("mice")
install.packages("missForest")
install.packages("FactoMineR")
install.packages("tidyverse")

Air pollution is currently one of the most serious public health issues worldwide. Many epidemiological studies have proved the influence that some chemical compounds, such as sulphur dioxide (SO2), nitrogen dioxide (NO2), ozone (O3) can have on health. Associations set up to monitor air quality are active all over the world to measure the concentration of these pollutants. They also keep a record of meteorological conditions such as temperature, cloud cover, wind, etc.

We have at our disposal 112 observations collected during the summer of 2001 in Rennes. The variables available are:

  • maxO3 (maximum daily ozone)
  • maxO3v (maximum daily ozone the previous day)
  • T12 (temperature at midday)
  • T9
  • T15 (Temp at 3pm)
  • Vx12 (projection of the wind speed vector on the east-west axis at midday)
  • Vx9 and Vx15 as well as the Nebulosity (cloud) Ne9, Ne12, Ne15

Here the final aim is to analyse the relationship between the maximum daily ozone (maxO3) level and the other meteorological variables. To do so we will perform regression to explain maxO3 in function of all the other variables. This data is incomplete - there are missing values. Indeed, it occurs frenquently to have machines that fail one day, leading to some information not recorded. We will therefore perform regression with missing values via multiple imputation.

  • Import the data.
ozo <- read.table("ozoneNA.csv",
                  header = TRUE,
                  sep = ",",
                  row.names = 1)

WindDirection <- ozo[, 12]
don <- ozo[, sapply(ozo, is.numeric)]   #### keep the continuous variables

summary(don)
##      maxO3              T9             T12             T15       
##  Min.   : 42.00   Min.   :11.30   Min.   :14.30   Min.   :14.90  
##  1st Qu.: 71.00   1st Qu.:16.00   1st Qu.:18.60   1st Qu.:18.90  
##  Median : 81.50   Median :17.70   Median :20.40   Median :21.40  
##  Mean   : 91.24   Mean   :18.22   Mean   :21.46   Mean   :22.41  
##  3rd Qu.:108.25   3rd Qu.:19.90   3rd Qu.:23.60   3rd Qu.:25.65  
##  Max.   :166.00   Max.   :25.30   Max.   :33.50   Max.   :35.50  
##  NA's   :16       NA's   :37      NA's   :33      NA's   :37     
##       Ne9             Ne12            Ne15           Vx9         
##  Min.   :0.000   Min.   :0.000   Min.   :0.00   Min.   :-7.8785  
##  1st Qu.:3.000   1st Qu.:4.000   1st Qu.:3.00   1st Qu.:-3.0000  
##  Median :5.000   Median :5.000   Median :5.00   Median :-0.8671  
##  Mean   :4.987   Mean   :4.986   Mean   :4.60   Mean   :-1.0958  
##  3rd Qu.:7.000   3rd Qu.:7.000   3rd Qu.:6.25   3rd Qu.: 0.6919  
##  Max.   :8.000   Max.   :8.000   Max.   :8.00   Max.   : 5.1962  
##  NA's   :34      NA's   :42      NA's   :32     NA's   :18       
##       Vx12              Vx15            maxO3v      
##  Min.   :-7.8785   Min.   :-9.000   Min.   : 42.00  
##  1st Qu.:-3.6941   1st Qu.:-3.759   1st Qu.: 70.00  
##  Median :-1.9284   Median :-1.710   Median : 82.50  
##  Mean   :-1.6853   Mean   :-1.830   Mean   : 89.39  
##  3rd Qu.:-0.1302   3rd Qu.: 0.000   3rd Qu.:101.00  
##  Max.   : 6.5778   Max.   : 3.830   Max.   :166.00  
##  NA's   :10        NA's   :21       NA's   :12
head(don)
##          maxO3   T9  T12  T15 Ne9 Ne12 Ne15     Vx9    Vx12    Vx15 maxO3v
## 20010601    87 15.6 18.5   NA   4    4    8  0.6946 -1.7101 -0.6946     84
## 20010602    82   NA   NA   NA   5    5    7 -4.3301 -4.0000 -3.0000     87
## 20010603    92 15.3 17.6 19.5   2   NA   NA  2.9544      NA  0.5209     82
## 20010604   114 16.2 19.7   NA   1    1    0      NA  0.3473 -0.1736     92
## 20010605    94   NA 20.5 20.4  NA   NA   NA -0.5000 -2.9544 -4.3301    114
## 20010606    80 17.7 19.8 18.3   6   NA    7 -5.6382 -5.0000 -6.0000     94
#dim(don)
  • Load the libraries.
library(VIM)
library(FactoMineR)
library(missMDA)

1) Descriptive statistics with missing values

Q1 When could it be a good idea to delete rows or columns with missing values to work with a complete data set? Could you do it here?

dim(na.omit(don))
## [1] 13 11

Deleting rows or columns is possible as long as there is enough data left and the missing values are of the type MCAR (meaning that the sample is a subsample of the original data). We will obtain unbiased estimators but with more variance. Deleting observations with missing data for ozone data leads to a table with 13 rows.

First, we perfom some descriptive statistics (how many missing values? how many variables, individuals with missing values?) and try to inspect and vizualize the pattern of missing entries and get hints on the mechanism. For this purpose, we use the R package VIM (Visualization and Imputation of Missing Values - Mathias Templ) as well as Multiple Correspondence Analysis (FactoMineR package). You should install the package VIM, then you can check the documentation by executing

?VIM

The package naniar can also be used. It was developped by Nick Tierney’s and based on ggplot. Naniar provides principled, tidy ways to summarise, visualise, and manipulate missing data with minimal deviations from the workflows in ggplot2 and tidy data.

library(naniar)
gg_miss_var(don)

The function VIM aggr calculates and represents the number of missing entries in each variable and for certain combinations of variables (which tend to be missing simultaneously).

res<-summary(aggr(don, sortVar=TRUE))$combinations

## 
##  Variables sorted by number of missings: 
##  Variable      Count
##      Ne12 0.37500000
##        T9 0.33035714
##       T15 0.33035714
##       Ne9 0.30357143
##       T12 0.29464286
##      Ne15 0.28571429
##      Vx15 0.18750000
##       Vx9 0.16071429
##     maxO3 0.14285714
##    maxO3v 0.10714286
##      Vx12 0.08928571
head(res[rev(order(res[,2])),])
##             Combinations Count   Percent
## 1  0:0:0:0:0:0:0:0:0:0:0    13 11.607143
## 45 0:1:1:1:0:0:0:0:0:0:0     7  6.250000
## 10 0:0:0:0:0:1:0:0:0:0:0     5  4.464286
## 35 0:1:0:0:0:0:0:0:0:0:0     4  3.571429
## 41 0:1:0:0:1:1:1:0:0:0:0     3  2.678571
## 28 0:0:1:0:0:0:0:0:0:0:0     3  2.678571

We can see that the combination which is the most frequent is the one where all the variables are observed (13 values). Then, the second one is the one where T9, T12 and T15 are simultaneously missing (7 rows) (1 is missing, 0 is observed - there is a 1 for the second, third and fourth variables). The graph on the right panel represents the pattern, with blue for observed and red for missing.

The VIM function matrixplot creates a matrix plot in which all cells of a data matrix are visualized by rectangles. Available data is coded according to a continuous color scheme (gray scale), while missing/imputed data is visualized by a clearly distinguishable color (red). This is useful to check if there is an association between the value of a variable and the missingness of another one.

matrixplot(don, sortby = 2)

## 
## Click in a column to sort by the corresponding variable.
## To regain use of the VIM GUI and the R console, click outside the plot region.
#Here the variable selected is variable 2. 

(Q2) Do you observe any associations between the missing entries ? When values are missing on a variable does it correspond to small or large values on another one ?

We observe that the temperature variables T9, T12 and T15 tend to be missing together (probably indicating that thermometers failed) [as well as the Ne9, Ne12 and Ne15 variables.] We see more “red” values. We do not see more black or white values which should imply that when T9 is missing it would have corresponded to high or low values in another variable which should suggest MAR missing values for instance. Here everything points to MCAR values.

The VIM function marginplot creates a scatterplot with additional information on the missing values. If you plot the variables (x,y), the points with no missing values are represented as in a standard scatterplot. The points for which x (resp. y) is missing are represented in red along the y (resp. x) axis. In addition, boxplots of the x and y variables are represented along the axes with and without missing values (in red all variables x where y is missing, in blue all variables x where y is observed).

marginplot(don[,c("T9","maxO3")])

(Q3) Interpret this plot

We can see that the distribution of T9 is the same when maxO3 is oberved and when maxO3 is missing. A discrepancy in the two boxplots (red and blue) would have implied that maxO3 missingness mechanism relies on the values of T9 (all high or all low for instance). Such a discrepancy is a way to suspect MAR hypothesis.

(R1) Create a categorical dataset with “o” when the value of the cell is observed and “m” when it is missing, and with the same row and column names as in the original data. Then, you can perform Multiple Correspondence Analysis with the MCA function from the FactoMineR package.

?MCA

MCA can be seen as the counterpart of PCA for categorical data and here is used to study associations between missing and observed entries. MCA is a straightforwardly tool to visualise the missing data pattern even if the number of variable is large. It shows if missing values occur simultaneously for several variables or if missing values occur when some other variables are observed.

data_miss <- data.frame(is.na(don))
data_miss <- apply(X=data_miss, FUN=function(x) if(x) "m" else "o", MARGIN=c(1,2))

# data_miss <- as_shadow(don) with the naniar package.
res.mca <- MCA(data_miss, graph = F)
plot(res.mca, invis = "ind", title = "MCA graph of the categories", cex  = 0.5)

2) PCA with missing values

Then, before modeling the data, we perform a PCA with missing values to explore the correlation between variables. Use the R package missMDA dedicated to perform principal components methods with missing values and to impute data with PC methods.

(R2) Determine the number of components ncp to keep using the estim_ncpPCA function. Perform PCA with missing values using the imputePCA function and ncp components. Then plot the correlation circle.

library(missMDA)
?estim_ncpPCA
?imputePCA

The package missMDA allows the use of principal component methods for an incomplete data set. To achieve this goal in the case of PCA, the missing values are predicted using the iterative PCA algorithm for a predefined number of dimensions. Then, PCA is performed on the imputed data set. The single imputation step requires tuning the number of dimensions used to impute the data.

nb <- estim_ncpPCA(don,method.cv = "Kfold", verbose = FALSE) # estimate the number of components from incomplete data
#(available methods include GCV to approximate CV)
nb$ncp #2
## [1] 2
plot(0:5, nb$criterion, xlab = "nb dim", ylab = "MSEP")

res.comp <- imputePCA(don, ncp = nb$ncp) # iterativePCA algorithm
res.comp$completeObs[1:3,] # the imputed data set
##          maxO3      T9      T12      T15 Ne9     Ne12     Ne15     Vx9
## 20010601    87 15.6000 18.50000 20.47146   4 4.000000 8.000000  0.6946
## 20010602    82 18.5047 20.86986 21.79932   5 5.000000 7.000000 -4.3301
## 20010603    92 15.3000 17.60000 19.50000   2 3.984066 3.812104  2.9544
##               Vx12    Vx15 maxO3v
## 20010601 -1.710100 -0.6946     84
## 20010602 -4.000000 -3.0000     87
## 20010603  1.950563  0.5209     82
imp <- cbind.data.frame(res.comp$completeObs,WindDirection)

res.pca <- PCA(imp, quanti.sup = 1, quali.sup = 12, graph=FALSE)
plot(res.pca, hab=12, lab="quali");

plot(res.pca, choix="var")

head(res.pca$ind$coord) #scores (principal components)
##               Dim.1      Dim.2       Dim.3      Dim.4      Dim.5
## 20010601 -0.6604580 -1.2048271  0.61855241 -0.6560074 -0.7455363
## 20010602 -1.2317545  1.0465411 -0.01998267 -0.6800051 -0.5946193
## 20010603  0.7984643 -2.7299508 -0.32698535 -0.1565173 -0.4888432
## 20010604  2.5423205 -1.7435774 -2.11763025  0.3045527 -0.1854849
## 20010605 -0.4047517  0.8406578  0.10643553  0.6279866 -0.5625156
## 20010606 -2.6701824  1.6934864 -0.20277160 -0.1967543 -0.8592908

The incomplete data set can be imputed using the function imputePCA performing the iterative PCA algorithm, specifying the number of dimensions through the argument ncp=2. At convergence the algorithm provides both an estimation of the scores and loadings as well as a completed data set. The imputePCA function outputs the imputed data set. The completed data set is in the object completeObs. The imputePCA function also outputs the fitted matrix \(\hat X\) in the object fitted.

(Q4) Could you guess how cross-validation is performed to select the number of components?

The cross-validation is performed with the Kfold method. A percentage pNA of missing values is inserted and predicted with a PCA model using ncp.min to ncp.max dimensions. This process is repeated nbsim times. The number of components which leads to the smallest MSEP (Mean Squared Error of Prediction) is retained.

Through the argument method.cv, the function estim_ncpPCA proposes several cross-validation procedures to choose this number. The default method is the generalised cross-validation method (method.cv=“gcv”). It consists in searching the number of dimensions which minimises the generalised cross-validation criterion, which can be seen as an approximation of the leave-one-out cross-validation criterion. The procedure is very fast, because it does not require adding explicitly missing values and predicting them for each cell of the data set. However, the number of dimensions minimising the criterion can sometimes be unobvious when several local minimum occur. In such a case, more computationally intensive methods, those performing explicit cross-validation, can be used, such as Kfold (method.cv=“Kfold”) or leave-one-out (method.cv=“loo”).

Kfold cross-validation suggests to retain 2 dimensions for the imputation of the data set.

3) Multiple imputation

Generate multiple data sets

We perform multiple imputation either assuming 1) Joint Modeling (one joint probabilistic model for the variables all together) - We use the R package Amelia, which is by default consider Gaussian distribution 2) Conditional Modeling (one model per variable) approach - We use the R package mice which by default consider one model of linear regression per variable 3) a PCA based model - We use the R package missMDA

For each approach we generate 100 imputed data sets.

library(Amelia)
?amelia
res.amelia <- amelia(don, m = 5)  
## -- Imputation 1 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
##  21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
##  41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
##  61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
##  81 82 83 84 85 86 87 88
## 
## -- Imputation 2 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
##  21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
##  41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
##  61 62 63 64 65 66 67 68 69 70
## 
## -- Imputation 3 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
##  21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
##  41 42 43
## 
## -- Imputation 4 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
##  21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
##  41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## 
## 
## -- Imputation 5 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
##  21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
##  41 42 43 44 45 46 47 48 49 50 51
#names(res.amelia$imputations) 
#res.amelia$imputations$imp1# the first imputed data set
library(mice)
imp.mice <- mice(don, m = 100, defaultMethod = "norm.boot") # the variability of the parameters is obtained 
  1. Now generate 100 imputed data sets with the MIPCA method and 2 components. Store the result in a variable called res.MIPCA.
?MIPCA
?plot.MIPCA
res.MIPCA <- MIPCA(don, ncp = 2, nboot = 100) # MI with PCA using 2 dimensions 

The function MIPCA gives as output the data set imputed by the iterative PCA algorithm (in res.imputePCA) and the other data sets generated by the MIPCA algorithm (in res.MI). The number of data sets generated by this algorithm is controlled by the nboot argument, equal to 100 by default. The other arguments of this function are the same as those for the imputePCA function.

Inspect the imputed values

Exploratory analysis is very important, even at this stage of the analysis.

We will inspect the imputed values created to know if the imputation method should require more investigation or if we can continue and analyze the data. A common practice consists in comparing the distribution of the imputed values and of the observed values. Check the compare.density function.

compare.density(res.amelia, var = "T12")

(Q5) Do both distributions need to be close? Could the missing values differ from the observed ones both in spread and in location?

Note that a difference between these distributions does not mean that the model is unsuitable. Indeed, when the missing data mechanism is not MCAR, it could make sense to observe differences between the distribution of imputed values and the distribution of observed values. However, if differences occur, more investigations would be required to try to explain them.

The quality of imputation can also be assessed with cross-validation using the overimpute function. Each observed value is deleted and for each one 100 values are predicted (using the same MI method) and the mean and 90% confidence intervals are computed for these 100 values. Then, we inspect whether the observed value falls within the obtained interval. On the graph, the y=x line is plotted (where the imputations should fall if they were perfect), as well as the mean (dots) and intervals (lines) for each value. Around ninety percent of these confidence intervals should contain the y = x line, which means that the true observed value falls within this range. The color of the line (as coded in the legend) represents the fraction of missing observations in the pattern of missingness for that observation (ex: blue=0-2 missing entries).

overimpute(res.amelia, var = "maxO3")

(Q6) Comment the quality of the imputation.

We can also examine the variability by projecting as supplementary tables the imputed data sets on the PCA configuration (plot the results of MI with PCA).

#plot(res.MIPCA,choice= "ind.supp")
#plot(res.MIPCA,choice= "var")

The plots represent the projection of the individuals (top) and variables (bottom) of each imputed data set as supplementary elements onto the reference configuration obtained with the iterative PCA algorithm. For the individuals, a confidence area is constructed for each, and if one has no missing entries, its confidence area is restricted to a point. All the plots show that the variability across different imputations is small and a user can interpret the PCA results with confidence.

Perform regression

MI aims at applying a statistical method on an incomplete data set. We now apply a regression model on each imputed data set of the amelia method and MIPCA methods.

(R3) Apply a regression model on each imputed data set of the amelia method. Hint: a regression with several variables can be performed as follows ‘lm(formula=“maxO3 ~ T9+T12”, data =don)’. You can also use the function with.

(R4) Now do the same with the imputed datasets of the MIPCA method.

resamelia <- lapply(res.amelia$imputations, as.data.frame)
# A regression on each imputed dataset
fitamelia<-lapply(resamelia, lm, 
                  formula="maxO3~ T9+T12+T15+Ne9+Ne12+Ne15+Vx9+Vx12+Vx15+maxO3v")  
# fitamelia <- lapply(resamelia, with, 
#                     lm(maxO3 ~ T9+T12+T15+Ne9+Ne12+Ne15+Vx9+Vx12+Vx15+maxO3v))
imp.mice <- mice(don, m=100,defaultMethod="norm.boot") # the variability of the parameters is obtained 
lm.mice.out <- with(imp.mice, lm(maxO3 ~ T9+T12+T15+Ne9+Ne12+Ne15+Vx9+Vx12+Vx15+maxO3v))
res.MIPCA <- lapply(res.MIPCA$res.MI, as.data.frame)
fitMIPCA<-lapply(res.MIPCA,lm, formula="maxO3~T9+T12+T15+Ne9+Ne12+Ne15+Vx9+Vx12+Vx15+maxO3v")
  • Aggregate the results of Regression with Multiple Imputation according to Rubin’s rule for MI with amelia and with PCA with the pool function from the mice package
poolamelia<-pool(as.mira(fitamelia)) 
summary(poolamelia)
##               estimate   std.error  statistic        df     p.value
## (Intercept) 31.7413516 13.11305165  2.4205923 29.444173 0.017827180
## T9           2.0594446  2.70476602  0.7614132  4.987863 0.448713112
## T12          2.1899467  2.16757686  1.0103202  7.895154 0.315475104
## T15         -1.0386685  1.03964740 -0.9990584 77.767803 0.320864894
## Ne9         -2.0384614  1.23037085 -1.6567862 12.286064 0.101592140
## Ne12        -3.0068079  1.95110007 -1.5410834 12.412691 0.127355940
## Ne15         0.2259209  1.44548922  0.1562937 13.147624 0.876206529
## Vx9          1.5348750  1.70158793  0.9020251  6.006911 0.369828827
## Vx12         0.1828039  1.73979681  0.1050720  9.184823 0.916589212
## Vx15         0.1112062  1.11146246  0.1000540 20.256904 0.920558973
## maxO3v       0.2670900  0.08381708  3.1865818 21.433386 0.002073369
poolMIPCA<-pool(as.mira(fitMIPCA))
summary(poolMIPCA)
##               estimate  std.error  statistic       df     p.value
## (Intercept) 13.2504338 17.9279604  0.7390932 64.01933 0.462489680
## T9           0.9204383  1.3080957  0.7036475 44.37785 0.484145106
## T12          1.6168145  1.0233209  1.5799683 49.28720 0.118929421
## T15          0.8186574  0.9247615  0.8852632 48.70017 0.379251626
## Ne9         -1.2619926  1.1601673 -1.0877678 57.78719 0.280682333
## Ne12        -1.6941487  1.5907938 -1.0649706 47.39347 0.290797636
## Ne15         0.3554946  1.1509317  0.3088755 61.61979 0.758396090
## Vx9          0.8028083  1.0918619  0.7352654 62.69895 0.464801329
## Vx12         0.8698144  1.2074738  0.7203588 58.77838 0.473865952
## Vx15         0.2979689  1.1917371  0.2500291 57.64791 0.803347056
## maxO3v       0.2560124  0.0921200  2.7791181 65.52953 0.007104977
#pool.mice <- pool(lm.mice.out)
#summary(pool.mice)

(R5) Write a function that removes the variables with the largest pvalues step by step (each time a variable is removed the regression model is performed again) until all variables are significant.

don2 <- don
reg <- lm(maxO3 ~. , data = don2)
while(any(summary(reg)$coeff[-1, 4]>0.05)){
  don2 <- don2[,!(colnames(don2)%in%names(which.max(summary(reg)$coeff[-1, 4])))]
  reg <- lm(maxO3 ~. , data = don2)
  }

We combine the results and performed the regression with missing values

# Submodel to compare
fitMIPCA<-lapply(res.MIPCA,lm, formula="maxO3~ T12+Ne9+Vx12+maxO3v")
poolMIPCA<-pool(as.mira(fitMIPCA))
summary(poolMIPCA)
##               estimate   std.error  statistic       df      p.value
## (Intercept) 10.7575019 14.33628678  0.7503688 63.61834 4.554321e-01
## T12          2.9249926  0.60696476  4.8190485 67.59434 7.573511e-06
## Ne9         -2.0162768  1.04165802 -1.9356418 63.22746 5.676463e-02
## Vx12         1.7131555  0.75360677  2.2732751 66.76914 2.593724e-02
## maxO3v       0.3308249  0.08291372  3.9899893 73.37628 1.548847e-04
#lm.mice.out <- with(imp.mice, lm(maxO3 ~ T12+Ne9+Vx12+maxO3v))
#pool.mice <- pool(lm.mice.out)
#summary(pool.mice)
fitamelia<-lapply(resamelia,lm, formula="maxO3~ T12+Ne9+Vx12+maxO3v")
poolamelia<-pool(as.mira(fitamelia))
summary(poolamelia) 
##               estimate  std.error statistic       df      p.value
## (Intercept) 13.9449001 10.1053464  1.379953 67.72540 1.721399e-01
## T12          3.1391473  0.5018982  6.254550 42.60877 3.076559e-08
## Ne9         -3.2082540  0.8439403 -3.801518 24.07200 3.107912e-04
## Vx12         1.1873687  0.6674422  1.778984 28.06404 7.972985e-02
## maxO3v       0.3053774  0.0675465  4.520995 59.53676 2.540656e-05

Single imputation of categorical data and mixed data with MCA - and FAMD

 ## Another ex of imputation of categorical data
data(vnf)

# Look at the pattern of missing values with MCA
MCA(vnf)

## **Results of the Multiple Correspondence Analysis (MCA)**
## The analysis was performed on 1232 individuals, described by 14 variables
## *The results are available in the following objects:
## 
##    name              description                       
## 1  "$eig"            "eigenvalues"                     
## 2  "$var"            "results for the variables"       
## 3  "$var$coord"      "coord. of the categories"        
## 4  "$var$cos2"       "cos2 for the categories"         
## 5  "$var$contrib"    "contributions of the categories" 
## 6  "$var$v.test"     "v-test for the categories"       
## 7  "$ind"            "results for the individuals"     
## 8  "$ind$coord"      "coord. for the individuals"      
## 9  "$ind$cos2"       "cos2 for the individuals"        
## 10 "$ind$contrib"    "contributions of the individuals"
## 11 "$call"           "intermediate results"            
## 12 "$call$marge.col" "weights of columns"              
## 13 "$call$marge.li"  "weights of rows"
#1) select the number of components
#nb <- estim_ncpMCA(vnf, ncp.max = 5) #Time-consuming, nb = 4

#2) Impute the indicator matrix 
res.impute <- imputeMCA(vnf, ncp = 4)
res.impute$tab.disj[1:5, 1:5]
##   Q7.1.1 Q7.1.2 Q7.1.3 Q7.2.1 Q7.2.2
## 1      1      0      0      1      0
## 2      1      0      0      1      0
## 3      1      0      0      1      0
## 4      1      0      0      1      0
## 5      1      0      0      0      0
res.impute$comp[1:5, 1:5]
##   Q7.1 Q7.2 Q7.4 Q8.1 Q8.2
## 1    1    1    3    1    2
## 2    1    1    1    2    1
## 3    1    1    1    2    2
## 4    1    1    1    1    1
## 5    1    3    1    1    1
## 2.2) Single imputation for mixed data with FAMD and with Forest
 #res.ncp <- estim_ncpFAMD(ozo)
 res.famd <-imputeFAMD(ozo, ncp = 2)
 res.famd$completeObs[1:5,1:5]
##          maxO3       T9      T12      T15      Ne9
## 20010601    87 15.60000 18.50000 20.17592 4.000000
## 20010602    82 17.20486 19.57898 20.42791 5.000000
## 20010603    92 15.30000 17.60000 19.50000 2.000000
## 20010604   114 16.20000 19.70000 24.34364 1.000000
## 20010605    94 18.89175 20.50000 20.40000 5.395526
library(missForest)
 res.rf <- missForest(ozo)
##   missForest iteration 1 in progress...done!
##   missForest iteration 2 in progress...done!
##   missForest iteration 3 in progress...done!
##   missForest iteration 4 in progress...done!
##   missForest iteration 5 in progress...done!
 res.rf$ximp[1:5,1:5]
##          maxO3     T9    T12    T15  Ne9
## 20010601    87 15.600 18.500 19.344 4.00
## 20010602    82 16.653 18.637 18.754 5.00
## 20010603    92 15.300 17.600 19.500 2.00
## 20010604   114 16.200 19.700 22.058 1.00
## 20010605    94 18.414 20.500 20.400 5.69