When the Outcome and Predictor Variables are (sort of) the same

w_scheidel

New Member
Hello, everyone,

I am working on a study with sensitive information, so it may suffice only to look at the last two paragraphs for my question, where I pose an analogous situation.

Basically, a singular event in Illinois is required to be reported to two agencies: one legal, and one social. One event, two actions, required by law. But we found a difference between these two reports, and we wanted to try explaining this difference. This is where we begin our study.

Basically, our study looks at the number of reports made to the social and legal entities in each county in Illinois; one limitation of this study is that we don't have access to the individual reports, so we assume that most reports in one agency correspond to the other. In almost all counties (98 of 102), the number of social reports exceeds (sometimes by a factor of 3) the number of legal reports. We have good reason to suspect why this might be the case, but it is a limitation to assume they mostly correspond to the same event, as again, we do not have individual-level data.

Our outcome variable is the number of social reports less the number of legal reports. We would like to run four simple negative binomial analyses (we started with Poisson to use a counting variable approach, but had to move to negative binomial due to overdispersion) in which our predictor variables are the number of each type of report found in the social reports. (The four counties with more legal reports are excluded because that kind of analysis cannot have negative numbers in the outcome variable.)

To give an analogy, since I cannot disclose the sensitive nature of our actual study, it would be like comparing two different agencies' inventory reports of an orchard's apple supply (where the orchard is divided into 102 subplots of various sizes). One is more thorough than the other, but assuming the latter isn't just lazy, could the type of apple predict reporting rates? Maybe Granny Smiths are harder to see than Red Delicious because they blend in with the leaves, and they aren't trained to count carefully. But we don't have data for the specific type of apples in the lower-quality inventory because they're too embarrassed or underpaid to file this kind of information. So we're trying to analyze, is the difference between these two agencies related to the type of apple, and can we analyze this by using the difference between the two inventories as the outcome variable, and the number (or proportion?) of one of four kinds of apples (for a total of four analyses) found in each subplot by the more thorough agency as the predictor variable?

We talked with a stats consultant who said that the predictor variables are too similar to the outcome variable, and it's like trying to answer a question with the question. Obviously if the predictor variable was total number of apples found by the more thorough agency, it would be a meaningless study, but we feel that there are way less Granny Smiths reported than Red Delicious, and we really want to see if there's some kind of analysis we can do. We entreat ourselves to the wisdom of this forum. Thank you.

Scheidel

noetsi

Fortran must die
Why don't you have access to the reports. In most states they would be public record and they have to give them to you.

I a confused about what exactly you are predicting, but trying to predict something with itself seems pretty useless (I agree with the consultant although again I am not sure exactly what you are trying to do). It would help to know what the purpose of the study is. When you say you found a difference in the reports, what does that mean? That one organization reports something more often, that it reports it differently....what are you actually trying to explain.

This seems like something that you would address with simple summary statistics. This event occurs X times, agency A filled out a report on it this percent of times and agency B filled a report on it this percent of times. If they are legally required to do it its hard to believe you can not obtain this information from the state unless the state does not want to know (in which case you have a major problem, you can try the freedom of information act if the state has an equivalent, mine does for example).

It would help to have more details on what you are actually trying to do.

Miner

TS Contributor
My background is in industrial quality, including industrial statistics. The situation you describe is very familiar to me and is similar to what you see in the news regarding the difference between actual crime rates and those reported to the FBI.

You see similar issues between inspectors and between customers/suppliers where one will tell the line operator to fix an issue, but will not actually reject it, while another inspector will reject and report it. It ends up being a difference in the operational definition of a nonconformance. The first inspector knows it's not right, but knows rejects are a lot of bother and may result in a heated argument with the line supervisor, so they use any ambiguity in the operational definition to rationalize why they don't need to report it. The other inspector has a more stringent interpretation.

This is a public agency example. In some cases, the definition is at fault, in others there may be political motives behind it.

w_scheidel

New Member
Why don't you have access to the reports. In most states they would be public record and they have to give them to you.

I a confused about what exactly you are predicting, but trying to predict something with itself seems pretty useless (I agree with the consultant although again I am not sure exactly what you are trying to do). It would help to know what the purpose of the study is. When you say you found a difference in the reports, what does that mean? That one organization reports something more often, that it reports it differently....what are you actually trying to explain.

This seems like something that you would address with simple summary statistics. This event occurs X times, agency A filled out a report on it this percent of times and agency B filled a report on it this percent of times. If they are legally required to do it its hard to believe you can not obtain this information from the state unless the state does not want to know (in which case you have a major problem, you can try the freedom of information act if the state has an equivalent, mine does for example).

It would help to have more details on what you are actually trying to do.
So we are looking at a specific category of child abuse which is required to be reported to both a state public health database and child protective services (CPS). There are far more reported to the public health database than to CPS, which we believe is because it often results in family disruption, for better or for worse. However, based on the kind of child abuse we're looking at, the law doesn't allow for discretion in reporting. So we want to know, why does this discrepancy exist? We then look at different types of child abuse within this category to see if the way in which the child presents to the mandated reporter affects who they report to.

Basically, we want to know if it is possible to use the difference between those two databases as an outcome variable, and the number of a specific type of report within the public health database as a predictor variable. Even just eyeballing the data, there is one kind of presentation that seems to be reported to CPS far less than other kinds, and we want to do some kind of analysis to show that.

The data was obtained using a FOIA application, but county level data for the public health database was as granular as we could get it.

Miner

TS Contributor
You might try a binary response using reported/not reported to CPS as the two levels.

w_scheidel

New Member
Thanks, Miner. The trouble with trying to dichotomize whether a public health report was made to CPS or not is that we don't actually have the public health reports, only the number of reports made in a given county for this category. Because we also have the (individual) reports from CPS for that same category, and it's usually less, we assume that the overlap in the overall numbers for each county is pretty similar (though that is a limitation we need to mention). So we aren't able to dichotomize them. :/ That leaves us wanting to say, if we assume that the public health database caught most or all of them, that means we can use their data to get an idea of how many of each kind of presentation there was, and then look at whether those contributed to reporting rates to CPS.

In the case of the link you sent, an analogous study would assess whether there was a difference when a bat, a knife, or fists were used in aggravated assault. We think there were a lot less fists reported relative to bats, and we want to try to show that analytically. I wasn't able to find a similar kind of analysis I could mimic in that article.

noetsi

Fortran must die
To me this seems like an issue that would be better addressed by qualitative analysis where you use focused interviews to find out what is being reported (and not) and why this is so. Its very difficult to use quantitative models to get at internal processes IMHO (my dissertation did this so its possible I am biased). That is I don't think you can find out why certain data does and does not get reported with any quantitative process - you have to carefully observe the process and work with those who do the job to understand this.

I don't understand how you can not have the public health reports? Are they not required to be on record by law (if so that is really surprising to me). Still if you have the number of reports for the two types of services why not simply list the difference between the two as an indicator of an issue. I don't think you need a predictor here or that you would gain anything if you did. This goes back to my comment that, as you believe I think and I would agree, this is tied to process or procedural issues in the agency. So you show a problem exist by the discrepancy and then follow this up by analyzing the process and procedures.

If the issues is not, why did this discrepancy occur, but how they varied by type then other than interviews you might try to find a variable (or variables) that you think mimic the rates of each and see how they compare over time (that is the ratio of each). You might then argue that the rates of each should be at a certain level when in fact they were not. Of course that assumes you can find something to use like this. There is a name for this type of analysis, which is utilized in political science but its been so long I can not remember it.

You might want to look up analysis of reliability as a measure when the issue is incorrectly reported data. I have not been involved in that type of analysis, but I suspect given how important an issue this is you would find it. How you proceed in honesty depends on if you are doing it for someone in authority in a state, like a legislative committee, or as private researchers (as at a university). Some of my suggestions are totally useless of course if the organizations in question will not cooperate.

Have you made a public records request for the information you mention is not publically released? Its possible the agency in question has the data, but not release it.

w_scheidel

New Member
Yes, that information should be out there somewhere, but that requires more communication than just between us and them. A lot of bureaucracy to shuffle through. But we do have overall numbers, so we can say there is some difference, and then we can certainly follow that up with other studies to look at why that might be the case. Thank you for both clarifying the difficulty of this question and for reinforcing what seems to me one of the best options going forward.