Probit Regression?

#1
So I was working on a Logistic Regression with binary DV, but now it turns out that the DV is actually probabilities such that they are between 0 and 1 (obviously).

Do I want to switch my focus onto Probit Regression now? I am finding some material but not a whole lot.

A push in the right direction would be greatly appreciated!
 
Last edited:

Dason

Ambassador to the humans
#2
When you say that the response is probabilities do you mean that it's actually a proportion? Or are we talking about a continuous outcome that could take any value between 0 and 1?
 

bukharin

RoboStataRaptor
#4
You can model proportions using a GLM with a logit link. See eg here. I can't think of any reason you couldn't also do it with a probit link (I just tried and it worked), but I generally prefer the logistic model because the coefficients are so much easier to interpret.
 
#5
My DV is the probability that a failure was caused by a certain ingredient. The DV ((dependent variable) is masked just because of the way this place takes in data, so I have to use a probability instead of a pass/fail scenario. And adding up the probabilities does not add up to 1. I do not have the data yet, but it looks like logistic might still be the way to go?
 
Last edited:

noetsi

Fortran must die
#7
If your DV is expressed as a probability (from 0 to 1 but with essentially infinite values between so its not a bivariate variable) why can't you simply use OLS which is much easier to interpret and run diagnostics on?
 
#8
Logit is about 0/1 values. But it can also be about a proportion for example 15 out of 20 (so that 15/20). But then that is a sum of basically 0/1 variables.

If it is values that can take any value between 0 and 1, then a beta-distribution might be useful.

I must admit I don’t understand this so I might have misunderstood this post:
The DV is masked just b/c
It comes to my mind “The Ecologists” posting instruction:
"So don't use instant-messaging [SMS] shortcuts. Spelling "you" as "u" makes you look like an semi-literate dud who just saved two entire keystrokes."
I don’t want to go that far, but if you want to be understood and answered, don’t use abbreviations.
 

Dason

Ambassador to the humans
#10
should be a semi-literate not an. Pot calling kettle black. Hopefully my posts are more understandable now:)
Greta was just quoting this thread on posting guidelines (sure there are mistakes in it but the point still stands). She hates abbreviations though. English isn't everybody's first language here (Greta is included in this) so she was just trying to make you more conscientious of that.
 
#12
The only reason it kind of bothered me is as soon as somebody sidetracks the post it becomes defunct, now I will get no more useful posts. But as Dason points out, I will make sure to make everything crystal clear for those that might not understand English so well. I do appreciate everyones help though. I am looking to do the beta distribution with the alteration to the 0's so I can keep my range as (0,1) instead of [0,1].
 

trinker

ggplot2orBust
#13
Well as long as the conversation is side tracked :) I'll help you with the latex tags you used above. You used latex rather than tex or MATH. I prefer math as tex would not have displayed your info correctly.

So...

[noparse]\(x' = \frac{x(N-1)+s}{N}\)[/noparse]

Gives you this...

\(x' = \frac{x(N-1)+s}{N}\)

Sorry to further side track but this may be helpful to you in posting here I know it was
for me.

OK let's get this thread back to its original intent everyone :D
 
#14
As Dason pointed out I was literally quoting “The Ecologists” forum guidelines (not “posting instructions” sorry for that). I am not a native English speaker and I guess that The Ecologist is not either. (And yes, I also noted that “an” but I didn’t want to change the quotation.) And besides, I said that I did not want to go that far. I interpret is as don’t cut down when it is not necessary.

But that is the formulation given in: “How to post”. It has been there for years. Anybody – who have read the guidelines - and is good in English - could have suggested a correction.

@Smoothjohn, I din not say "semi-literate”. It was the forum guidelines.


She hates abbreviations though.
No, I don’t hate abbreviations. I just feel sorry for those who post and are not understood. This is an international site. What is obvious for some in one country might not be understandable in other countries, for example for our friends in India and Nigeria. What is obvious for the psychometrician might be unknown for chemometricians.


We can talk about glm, hglm, gee, gllamm gam, gmm,gamlss, glmm and the pros and con of each of them. I am sure that Dason understands this and could give a lecture about each of them, but - and this is my point - would the hundreds of readers here understand all of that? And it only takes one abbreviation to lose the reader.

@Autobot. Now you can ask your self which do you prefer: to be given a suggestion in broken English [you seems to have accepted the idea of betadistribution] and be told that I had not understood your abbreviation, or be left without suggestion?

This is a suggestion for the improvement of this community, for Autobot and other writers: If you want to be understood and answered, don’t use abbreviations!
 

bukharin

RoboStataRaptor
#15
s=arbitrary number in (0,1), I am choosing 0.5
This is why I don't like this approach - you need to choose an arbitrary number to get your model to run. If you change the arbitrary number then you get slightly different results. Using a generalised linear model (GLM - see Greta I'm listening!) you do not need to do this kind of fudge.

Logit is about 0/1 values.
Only by convention. It's just another data transformation. It happens to be very useful for modelling binary proportions, which is why it's used for that; but as far as I know there is absolutely no statistical or mathematical reason why it shouldn't be used for arbitrary proportions between 0 and 1.

Of course you can always try a couple of different models and then choose the one that provides the best fit for you data.
 

Dason

Ambassador to the humans
#16
Only by convention. It's just another data transformation. It happens to be very useful for modelling binary proportions, which is why it's used for that; but as far as I know there is absolutely no statistical or mathematical reason why it shouldn't be used for arbitrary proportions between 0 and 1.
Sure the logistic transformation is just a transformation. But logistic regression does need to be binomial data because that is the assumption that is being made. We could fit a logistic curve using nonlinear regression though.
 

bukharin

RoboStataRaptor
#17
Sure, but as Greta said, a proportion is just a combination of binary values. I forgot to mention in my initial post to use a robust variance estimator for the GLM model. I would be (very) interested to see a real-life example where using a GLM with a logit link, or using nonlinear regression to fit a logistic curve, gave meaningfully different results.
 

Dason

Ambassador to the humans
#18
Sure, but as Greta said, a proportion is just a combination of binary values.
Yes a proportion is - but we would need to know the number of successes and number of trials. A "probability" could just be an estimate - I have no idea how the OP is getting these "probabilities" so we can't assume that it's actually a proportion.
 

bukharin

RoboStataRaptor
#19
That's true - in fact in general I think the OP (sorry Greta "original poster" ;)) needs to give more info, since it's unclear to me whether any of the approaches we've discussed would be appropriate.

I am signing off but thanks for the interesting discussion.
 
#20
After talking with some other people I misinterpreted what the data was saying. I will now be doing a Logistic regression of pass/fail (where any number of complaints is a pass/fail) and another regression that will be a poisson regression since there are days with multiple complaints. If I remember correctly, there is practically no overdispersion so I am not going to use quasiposson general linear model. Should I still use quasi? Anyways, thanks for all the help and I will keep my posts foreign friendly:D