# Link function for proportional outcome

#### trinker

##### ggplot2orBust
I have a data set where the outcome variable is percent passing (ELA and Math tests) for school districts. I will use a 2 level multilevel model with various predictors/covariates at level one and two.

The outcome variable is percent passing. Obviously the outcome is limited to between 0 and 1 and thus it is not sensible to assume normal distribution (the scores are likely normally distributed) but using a Gaussian link could result in predictions > 1 and < 0. A logit might make sense (binomial family) as this is used in logistic regression (0/1) but it seems wrong because I can take any value between 0 and 1.

Poisson deals with count data. I don't have count.

So what link function is appropriate here and why?

If more details are needed I can furnish them.

#### Dason

logistic regression works for general binomial data (n > 1) - you don't need to just have 0/1. Do you have the value for n for each observation?

#### trinker

##### ggplot2orBust
Dason you I didn't quite follow. I assume you're saying that I can treat it the same as if though the outcome were 1/0 and use the link function as binomial. This actually seems(ed) sensible but I had seen that using the binomial link was for 1/0 outcomes only. But maybe that was my misinterpretation.

I have observational data. The lowest level I have is district level information on percent of students who passed. I also have aggregated demographic characteristics for each district. I have perecent passed but I also have the n for the school districts so pulling actual n out is doable:

Code:
round(percent_passed * n) = n_passed

#### trinker

##### ggplot2orBust
@vict I'd be inclined to agree except that assumption will give predicted values > 1 and < 0. This is not possible.

#### Dason

Dason you I didn't quite follow. I assume you're saying that I can treat it the same as if though the outcome were 1/0 and use the link function as binomial. This actually seems(ed) sensible but I had seen that using the binomial link was for 1/0 outcomes only. But maybe that was my misinterpretation.
Yeah that's your misinterpretation - this is fine for logistic regression. By the way it's a logit link (not a binomial link) with a binomial family. Basically you're saying conditioned on your covariates the response follows a binomial distribution. The logit link function is how you 'link' the covariates to the success probability - it's what models the form the of the relationship between x and p.

I have observational data. The lowest level I have is district level information on percent of students who passed. I also have aggregated demographic characteristics for each district. I have perecent passed but I also have the n for the school districts so pulling actual n out is doable:

Code:
round(percent_passed * n) = n_passed
Yeah you can do logistic regression with that data.

#### trinker

##### ggplot2orBust
@spunky I'll let the discussion go a bit before I decide but this seems to be exactly what I'm after. I also have to do this in HLM program as the requirement of my multilevel course is that I use this program. Do you know if this is available in HLM? I have never heard of it (which means next to nothing) so maybe it's not a commonly used link function yet?

#### Dason

how come there is no love here for beta regression??
It is more difficult and in the case where you actually have the counts it makes more sense to do something like logistic regression. There isn't really much motivation behind using beta regression in this type of case in my opinion. Plus logistic regression is hard enough for non-math people to interpret and understand but it's a lot easier to understand than beta regression (binomial distribution is pretty simple compared to the beta distribution...)

#### Dason

I have never heard of it (which means next to nothing) so maybe it's not a commonly used link function yet?
I think you have a misunderstanding when it comes to the link function. Beta regression is using the beta distribution as the response distribution (what we call the 'family' in glm) - this doesn't directly specify the link function. The link function is how you "link" the covariates to the mean of the response at those values of the covariates.

#### trinker

##### ggplot2orBust
Dason said:
Yeah that's your misinterpretation - this is fine for logistic regression. By the way it's a logit link (not a binomial link) with a binomial family. Basically you're saying conditioned on your covariates the response follows a binomial distribution. The logit link function is how you 'link' the covariates to the success probability - it's what models the form the of the relationship between x and p.
Thanks, for the help on using the correct language. Great explanation.

Can I use the percent pass in with a logit link with the binomial family or are you saying use the n_passed (round(percent_passed * n) = n_passed). The n_passed makes less sense because I don't have actual data on individual students though I can make up ids for them arbitrarily and then assign pass fail based on round(percent_passed * n) = n_passed but I don't see what that buys me.

#### spunky

##### Can't make spagetti
Plus logistic regression is hard enough for non-math people to interpret and understand but it's a lot easier to understand than beta regression (binomial distribution is pretty simple compared to the beta distribution...)
this is *exactly* why beta regression needs to be used MORE often. it helps you leave people puzzled and unable to criticize your work. when faced with their own ignorance, they have little option but to think along the lines of "well, this seems complicated enough so it must be right".

but you do have a point though. i assumed the emphasis was on the percentages and not on the counts themselves but if you have the counts then go for logistic regression.

#### Dason

You don't need data for individual students. Did I say something that implied that you did? You need the total count and the total number of passed (the outcome from the 'binomial' experiment) but you don't need the outcomes for each student individually.

#### trinker

##### ggplot2orBust
Dason said:
I think you have a misunderstanding when it comes to the link function.
Yes this is True. I think it's clearer now. I was thinking link actually transforms the 0/1 but it doesn't it works on the aggregated outcomes (which is percent passed failed). Is this correct?

#### trinker

##### ggplot2orBust
Dason said:
You don't need data for individual students. Did I say something that implied that you did? You need the total count and the total number of passed (the outcome from the 'binomial' experiment) but you don't need the outcomes for each student individually.
No but my thinking is if I supply counts how will it know what the counts mean. Say I give it 900 students in district A passed and 1230 in District B passed. How will it (HLM program) know what those numbers mean without either individual data data (passed or not passed) or a way to say 900 out of 2000 students.

I mean it's sensible you can do this with equations and figure it out that way but I have to give it a data file.

#### Dason

Yes this is True. I think it's clearer now. I was thinking link actually transforms the 0/1 but it doesn't it works on the aggregated outcomes (which is percent passed failed). Is this correct?
No - it doesn't do anything to the data itself. It models the relationship between the data and the mean. You don't transform the predictors.

For logistic regression you're assuming that
$$Y_i \sim Bin(n_i, p_i)$$

which says that the response has a binomial distribution with parameters $$n_i$$ (the number of observations/students observed for this response) and $$p_i$$ (the success probability for each observation/student).

That seems simple enough but the logistic regression part adds the assumption that we can additionally model the $$p_i$$ as a function of the covariates. This is what allows us to think things like "the success probability increases as the covariates increase". How we actually 'link' the $$p_i$$ with the covariates depends on ... you guessed it - the link function. For logistic regression we assume

$$log(\frac{p_i}{1-p_i}) = \beta_0 + \beta_1x_i$$

So we are saying that if we apply the link function to $$p_i$$ we get a linear function with respect to the covariates. Notice we don't apply the link function to the covariates - we apply it to $$p_i$$.

#### Dason

No but my thinking is if I supply counts how will it know what the counts mean. Say I give it 900 students in district A passed and 1230 in District B passed. How will it (HLM program) know what those numbers mean without either individual data data (passed or not passed) or a way to say 900 out of 2000 students.

I mean it's sensible you can do this with equations and figure it out that way but I have to give it a data file.
You would need to tell it what the total count is for each school. Your response for each observation is essentially a vector of length 2 which specify the number of students that passed and the number of students that took it total.

I don't know how you do this in HLM though - I've never used that program.

#### trinker

##### ggplot2orBust
Gotcha. It all becomes appalling clear.

#### GretaGarbo

##### Human
how come there is no love here for beta regression??
There is a lot of "love" for beta regression. Maartenbuis gave (thanks again for that!) a great link to beta regression and also suggested the possibility of fractional logit in this thread.

But I guess that beta regression is mainly for variables that are not built from 0/1 variables, but instead from variables like: fraction of income spent on food, fraction of time spent on talkstats etc.

#### trinker

##### ggplot2orBust
@Dason there was one response there that was interesting and I think would work with HLM:

Peter Ehlers said:
Yes, and you can also use the proportions directly; just specify
the corresponding vector of number of trials as the 'weights'
Thoughts?