Is this considered data leakage?

Hello again! I work in car insurance. For simplicity, let's say there are two ways to report a claim: 1-800 number or mobile app. I am interested in studying which claim attributes are associated with each report method. So, I might want to make a statement like "intersection accidents increase the odds of reporting through the phone by X% relative to single vehicle accidents." In reality, I will probably use a multinomial model to account for n>2 report methods.

I have always thought that we can't use predictor variables that populate after the outcome variable has been determined. What if I'm not trying to predict in real time, but rather collect retrospective data and examine relationships after the claim is settled? For example, can I use claim severity to say something like "for each $1000 increase in damage, the odds the claim was reported through the phone increases by X%" This is my dilemma because damage assessment is not determined until after claim report. But, it is feasible to think that a heavily damaged vehicle might skew towards a certain report method over another. I could say the same thing about type of accident. We only know that after the customer tells us it was an intersection accident. Is there a better example (in other industries) where this type of retrospective analysis is used?
Last edited:


Less is more. Stay pure. Stay poor.
If you are looking solely for an overall prediction value - it is relevant to include cause of the target, effects of the target, and other causes of the effects of the target (spouses). So for me to understand you genetics, knowing your parents genes, kids genes, and wife's genes (other cause of children), I can best understand you. However, I think it may be difficult to figure out their individual attributes if they are connected amongst each other. in some regard. Look up the concept of a Markovian Blanket. I feel like in prediction, you can see effects reported based on reverse causality, e.g., probability of disease given positive test. You can't get a positive test unless you already have the disease.

I haven't over investigated this topic, so I would be interested in what you find. But the distinction here is, trying to predict not get causal effects - since you are ignoring the arrow of time. Additional comment, a model doesn't know the difference between:

X -> Y
X <- Y, it is up to you to know the context.
Thanks for the input. I will read more into the topic. I'm thinking about your disease/positive test example. I can get a positive test without having the disease. It's just a false positive, right?


Less is more. Stay pure. Stay poor.
Correct, if the screening variable is not that specific, you can get a false positive. We are talking about a epistemological setting where everything is not known down to a micro-level, so things are not deterministic. We are in statistics land!