# Using logistic regression and z scores to predict the impact of the difference in two variables on a third

#### wildy_stats

##### New Member
Wondering if my logic and use of z-scores and logistic regression makes sense.

I am doing research with outcome data from a psychological treatment program that works with adolescents. A similar survey to assess mental health functioning has been given to the adolescents participating in the program as well as their parents at both time points admission and discharge.

We're interested in what factors may lead to an adolescent deteriorating during their stay at the program (adolescent discharge score - adolescent admission score must meet certain criteria). We set up the dichotomous "deterioration" variable.

One hypothesis we wish to test is if the difference between parent and adolescent scores at intake has a significant impact on whether or not the adolescent deteriorated. Even though the surveys administrated to the parents and adolescents are worded mostly the same, they have different "cutoff scores" (which indicate whether or not a score reflects an individual in the clinical range). Because of this, I'm assuming I need to compare the two scores based on their respective z scores.

Once calculating the z-scores for both these variables (adolescent_z and parent_z) I want to set up a "intake_difference" variable:

intake_difference = parent_z - adolescent_z

I'm thinking I want to use logistic regression to see what variables predict the deterioration variable. In order to do this, I plan to bin the continuous intake_difference variable in a discrete variable based on its standard deviation (-2, -1, 0, 1, 2) then run that new variable into the logistic regression.

Does this approach make sense? Is there a better cleaner way of doing this? Is binning the data based on standard deviation good practice? Thank you so much for your help!

#### hlsmith

##### Less is more. Stay pure. Stay poor.
You always lose information when you discretize or dichotomized. What makes the thresholds and cutoffs definitive, and why do you have to use them?

Please better describe the outcome variable and perhaps present toy data so we can follow along. Also, how are you going to get z-scores out of your instrument data? What serves as the mean, and how good of proxy do you think that is?

#### wildy_stats

##### New Member
Thanks for your response! The cutoff and label of "deterioration" are both defined by the makers of the survey used. I would be open to a process by which we still use the original data to determine if degree of difference in the two intake variables could be used to predict an individual's outcome score. I was just thinking bringing it into bringing it into dichotomized and using logistic regression would be easiest, but I'm clearly new to all this and am open to suggestions.

Here is a sample of what I'm trying to do written in R:

Code:
library(tidyverse)

# Setup toy data----

set.seed(10)
adoles_intake <- rnorm(100, mean = 71, sd = 33)
set.seed(20)
adoles_discha <- rnorm(100, mean = 48, sd = 32)
set.seed(30)
parent_intake <- rnorm(100, mean = 99, sd = 28)

# Determine which clients experienced deterioration during the program
# Deterioration = yes if:
##  (Meaning the score changed in a positive direction, indicating a worse (more acute) level of functioning)
# AND adoles_discha >= 47
# # (Meaning client left the program with a score indicating a need for continued treatment)

# Set up change variable
df1$delta <- df1$adoles_discha - df1$adoles_intake # Determine whether or not adolescent deteriorated df1$deterioration <- ifelse(df1$delta > 1 & df1$adoles_discha >= 47, 1, 0)

## Determine if the difference between adolescent and parent scores is a useful predictor of deterioration in program----

# This is where I may start getting into trouble (if I'm not already...)

# Set up z scores for adoles_intake and parent_intake in order to compare
df1$adoles_intake_z <- (df1$adoles_intake - mean(df1$adoles_intake)) / sd(df1$adoles_intake)
df1$parent_intake_z <- (df1$parent_intake - mean(df1$parent_intake)) / sd(df1$parent_intake)

# Calculate difference in intake z-scores
# Subtracting adolescent scores from parent scores because parents tend to score higher
df1$intake_diff <- df1$parent_intake_z - df1$adoles_intake_z ## Set up intake difference category ---- # Not sure if I'm allowed to do all of the above # But if I am, and I then categorize the intake_diff score based on where values fall in standard deviation? mean(df1$intake_diff) # Basically 0, because going off of z score data
sd(df1\$intake_diff)  # ~ 1.50

# Categories desired: (-2, -1, 0, 1, 2)
# So category will be: "-2" if intake_diff less than -2.38, "-1" if  between -2.38 and -1.19, etc.

df1 <- df1 %>%
mutate(intake_diff_cat =
case_when(intake_diff < -3 ~ -2, # -2 sd from mean (0)
intake_diff >= -3 & intake_diff <= -1.50 ~ -1, # -1 sd away
intake_diff > -1.50 & intake_diff < 1.50 ~ 0, # 0 sd away
intake_diff >= 1.50 & intake_diff < 3 ~ 1, # 1 sd away
intake_diff >= 3 ~ 2)) # 2 sd away

# Then run logistic regression to determine if this variable is a useful predictior of deterioration

log <- glm(deterioration ~ intake_diff_cat,
data = df1, family = "binomial")
summary(log)

Not sure if all/any of this makes sense. Open to all suggestion and grateful for feedback!

Thanks!!