Logistic regression with partly aggregated data

#1
Hi,

I want to use logistic regression on a data set. The y-variable is a binary outcome (yes or no to adverse effects) and the x value is blood concentration of a protein (continous variable). I want to test if higher concentration is related to more frequent adverse effects.

I have 450 observations. 150 observations (1/3 of all observations) have a concentration below the laboratory detection limit (<100 units), the rest lies between 100 to 3500 units.

My question: When doing a logistic regression with these data where 1/3 is "pooled", can I just set these data to be equal to the lowest detectable value (100 units) or do i need to make further adjustments to my regression/data to get the best model.

Thank you in advance
 

hlsmith

Not a robit
#2
Is there a known pattern in the values where they could be imputed? I would look to the literature as well to see if this has come up before.
 
#3
There is population based contration distributions of the x variable. But i can not discriminate between the observations hence i do not see how i can assign them different x-values. Would it be a statistical problem assigning them all with the same value (the lowest detectable limit) when using logistic regression?
Alternatively i may be capable of finding the most accurate average x-value.
I am yet to find discussions regarding similar problems.

Thank you in advance
 

hlsmith

Not a robit
#4
Is this the only term you are modeling in the logistic regression?

Hmm, not my area but this definitely has to have come up previously. Most any approach is going to result in a loss of formation. Does the bioassay say at what level the values became undetectable, that and prior literature. An assumption in logistic regression is linearity in the logit. Which in my interpretation is a linear relationship in the continuous variable and the log transformation in the model. I am not sure how experience you are so I will throw out a few ideas, but first I will note that running a bunch of ideas then selecting one is at risk for false discovery of significant results.

First thing I would think about doing is running a Generalized Additive Model (GAM) based on data you have and excluding the lowest detectable values (LDV). This model will help you understand if there is a linear relationship in the values that you do have or not. Another option may be selecting a couple of values for the LDVs and running the model with each and see what it is doing to your estimates. You could always run models with multiple values and report that, if issues aren't occurring. Another option is if you selected a value for these LDVs, you could also simulate it to get a little more variability in its values. I saw some place as well possible robust SEs for logistic regression. I am not sure if you should also think about these.
 

j58

Active Member
#5
@Mamaho,

If you're still around, you might try the following. Create a variable z that takes the value 0 if x is below the detection limit, and 1 otherwise; and assign a value of 0 (or any numerical value at all, it won't matter) to x if x is below the detection limit. Then run the following logistic regression model:

logodds(y) = b0 + b1*z + b2*z*x .

In this model, exp(b0) is the predicted odds of y when x is below the detection limit; exp(b2) is the estimated odds ratio for a 1-unit increase in x, given that x is above the detection limit; and exp(b0 + b1 + b2*x) is the predicted odds of y at detectable concentration x. Note that z is a necessary term in the model, but its regression coefficient, b1 (or exp(b1)), alone has no meaningful interpretation.
 
Last edited:
#6
Is this the only term you are modeling in the logistic regression?

Hmm, not my area but this definitely has to have come up previously. Most any approach is going to result in a loss of formation. Does the bioassay say at what level the values became undetectable, that and prior literature. An assumption in logistic regression is linearity in the logit. Which in my interpretation is a linear relationship in the continuous variable and the log transformation in the model. I am not sure how experience you are so I will throw out a few ideas, but first I will note that running a bunch of ideas then selecting one is at risk for false discovery of significant results.

First thing I would think about doing is running a Generalized Additive Model (GAM) based on data you have and excluding the lowest detectable values (LDV). This model will help you understand if there is a linear relationship in the values that you do have or not. Another option may be selecting a couple of values for the LDVs and running the model with each and see what it is doing to your estimates. You could always run models with multiple values and report that, if issues aren't occurring. Another option is if you selected a value for these LDVs, you could also simulate it to get a little more variability in its values. I saw some place as well possible robust SEs for logistic regression. I am not sure if you should also think about these.
Thank you for your reply and sorry my very late reply. I got put on another project and just came back from vacation. I will look into your proposal of running a GAM excluding the LDVs.
 
#7
@Mamaho,

If you're still around, you might try the following. Create a variable z that takes the value 0 if x is below the detection limit, and 1 otherwise; and assign a value of 0 (or any numerical value at all, it won't matter) to x if x is below the detection limit. Then run the following logistic regression model:

logodds(y) = b0 + b1*z + b2*z*x .

In this model, exp(b0) is the predicted odds of y when x is below the detection limit; exp(b2) is the estimated odds ratio for a 1-unit increase in x, given that x is above the detection limit; and exp(b0 + b1 + b2*x) is the predicted odds of y at detectable concentration x. Note that z is a necessary term in the model, but its regression coefficient, b1 (or exp(b1)), alone has no meaningful interpretation.
Thank you for your reply and sorry for my very late reply. I got put on another project and just came back from vacation.
I am not completely sure what result i would get running this regression and how it will help me identify problematic data.

Kind regards
 
#8
Thank you for your reply and sorry my very late reply. I got put on another project and just came back from vacation. I will look into your proposal of running a GAM excluding the LDVs.
The bioassay can´t estimate values below 100 (LDV). We are looking at a total of 463 observations with 147 below 100 (LDV) and 316 with an exact value. I just tried to assign the 147 measurements a range of values from 0-100 and running logistic regressions. It did not change a lot on the model results and only slightly on the p-value. We originally chose the value of 50 for all measurements below the detection limit, this based on what we estimated as being a reasonable average. We might be able to estimate a rather precise average value using other litterature. Since the model did not change much when using a wide range of different values, would it be statistically sound to choose our best estimation of an average value?

Kind regards and thanks in advance
 

hlsmith

Not a robit
#9
Do you know what the predictors of the bioassay results are. You could always take a step back and try to model those values for a more informed simulation. For example if you modeled the values you do have >100, can you figure out if say age, gender, comorbidies, etc. are predictive of the bioassay values. Then use these variables to help influence your simulation of the LDV data.

Also, whichever approach you take, you will likely want to simulate the LDV values, even if it is just random draws from a distribution. This way you are not just calling all 147 units fifty. This will require thinking about what they may actually look like and creating the parameters for that distribution. What I mean is, I imagine those values are left skewed - with most values closer to 100 and fewer trailing off toward 0. Or I could be wrong.

P.S., Can you post a histogram of your data distribution.
 
#10
Do you know what the predictors of the bioassay results are. You could always take a step back and try to model those values for a more informed simulation. For example if you modeled the values you do have >100, can you figure out if say age, gender, comorbidies, etc. are predictive of the bioassay values. Then use these variables to help influence your simulation of the LDV data.

Also, whichever approach you take, you will likely want to simulate the LDV values, even if it is just random draws from a distribution. This way you are not just calling all 147 units fifty. This will require thinking about what they may actually look like and creating the parameters for that distribution. What I mean is, I imagine those values are left skewed - with most values closer to 100 and fewer trailing off toward 0. Or I could be wrong.

P.S., Can you post a histogram of your data distribution.
Thank you. I agree simulating would make sense. There is no known predictors for assay concentration. I have attached my histogram, without a few high values. Due to units, my data shows 1/10 of the above numbers, i.e 100=10.
The concentration distribution is known to start at 0 then rise to a point (depending on race) between 10 to 20 and then decrease similarly to the initial increase but with a long tail towards high concentrations. My data are very much in line with other litterature with the peak around 12. The LDV values are depicted as one column at the value 5, but ideally should be distributed as something close to a parabola with vertex at the data peak.
I am using STATA. Not sure how yet, but i guess it is possible to simulate the LDV values in the desired pattern towards the detection limit.

Thank you in advance.
 

Attachments

hlsmith

Not a robit
#11
Not sure if this would be of benefit, but I randomly came across this paper:

Accommodating measurements below a limit of Detection, Am J epidemiol. 2014;179(8):1018-24.
 

hlsmith

Not a robit
#13
I recently came across two approaches:

Simple imputation approach commonly used: LOD / (sqrt(2))

or

Impute based on truncated normal distribution within Bayesian framework.