Imputing missing values thanks to correlated data ?

Hello everyone,

I'm new here, nice to meet you :) As English is not my native language, I apologise if I'm not always clear, please don't hesitate to ask me to repeat in that case.

There is a question I would like to ask you :
Suppose I have a statistical series corresponding to the different values of a variable through time. I have some missing values in this series, and I would like to estimate these values.
Now, I have a second series corresponding to a second variable. This one is complete (I have all of the wanted values). This variable is strongly correlated to the first one (R² close to 1, p-value<0.001).

Is it possible that I estimate the missing data of the first variable thanks to the data corresponding to the other variable ?
I should specify that although those two variables are very correlated, I can't really assume that one explains the other (so, I can't use linear regression, unless I'm wrong).

I hope that my question is clear and I thank you in advance.
Missing data analyses is complicated.

It all depends on how your data is missing (i.e. missing completely at random, missing at random, missing not at random). How much of the data is missing. The type of data and what you plan to do with the data once imputed?

If the two variables are so highly correlated (assuming this high correlation is not the result of missing data), why can't you use the complete variable instead for your analyses?


Omega Contributor
I will echo evelyn13's comments, it depends on why they are missing. Look up monotonic missing pattern - does this seem applicable given it is time series data?


Fortran must die
In the context of time series its often stated that you can not miss any points in the series analyzed unlike regular data. Issues like MAR and MCAR don't apply to it, perhaps because of the issue of seasonality.

Interestingly in analysis of missing data I have not seen whether time series has any impact on how you fix the data, replace missing values.


Omega Contributor
There are slightly different approaches in time series I believe (last observation carried forward, nearest neighbor, etc.). Also, depending on your data, mixed models can function with missing data, though that approach could be less than ideal if data are MAR or MNAR.


Fortran must die
There are many approaches to replacing time series missing data. I tend not to think of that as replacing missing data in the MI context, the logic appears to be different. I don't think these approaches even consider the issue of MAR or MCAR. I suspect they feel time series data when missing is entirely accidental [it just was not gathered that period] so all such data is MCAR.
Thank you for all your answers. Indeed the data are MCAR, if I'm not mistaken. Actually, they correspond to daily mean temperature values for a long period of time (several years), on different sites. On some sites, data is missing some months ot some years because the device to measure temperature was not active. Most of the time these are the first years which are missing (so, I'd say this is a monotone missing pattern ?). I'd say that, in total, about 1/6 data are missing.

The "complete" variabe corresponds to, kind of, a temperature estimator less precise than the daily mean temperature measures. What I'd like to do is to explain a third variable with the daily mean temperature variable. So, I guess I could explain this other variable with the variable that is "complete", but this complete variable does not really correspond to my hypothesis (and it tends to overestimate daily tempeature, though it is well correlated to it). Actually I don't know, maybe keeping this complete variable as an explanatory variable would be the best option. But my supervisors asked me to estimate the "real" temperature variable (I'm a master's student intern), maybe I can discuss this with them. But I can't do my analysis with missing data because I'm using a specific statistical tool where all the data is needed to be present (at least at several periods within years).

I thought too about nearest-neighbor methods, but doesn't this mean I could not use the information given by the correlated variable ? I'll totally take it if you have some ideas/advices, I admit I'm only a beginner :) But maybe the idea of keeping the complete variable remains the best ?


Fortran must die
You are learning the joy [aka pain] of working with real data in the real world - where commonly there are no easy and or good solutions.

Multiple imputations, which uses existing variables the missing data are correlated with to create values for the missing data might work. I have not seen this used with time series although I assume it does. You might look this up and see if it deals with your problem. Its not the simplest of methods although if you are a master's student it will be useful to learn.
My apologies, I read the OP in a rush and completely missed that it was time series data.
Out of interest what analyses are you planning on using?
Thank you for your kind answers !!!

Sorry for answering so late. I'm planning to use a regression analysis or a time series analysis to interpret the effects of these climatic covariates (in interaction with other ones) on biological variables. My objective is not really to predict the times series, but more to explain the links between the variables. I still don't really know how I'm going to do this however, I'm still thinking about it :)