Multiple regression with non-normal distribution?

#1
I have some data that I would like to analyse by using multiple regression, as I am interested in the predictive ability of a model (variables influencing the likelihood of being bullied). I have just completed an initial dataset and have been looking at some of the possible variables.

Unfortunately, three of my main variables are far from being normally distributed - these refer to questionnaire data totals (e.g. on on bullying [DV], in which most children are not bullied, so there is a very strong positive skew). This will not be the case with other data which is likely to be more normally distributed (e.g. academic data).

My question is whether there is a way of having a predictive model that can cater for a number of seriously skewed variables, while others are not. I understand that it is possible to transform data, but wonder if I then need to transform all variables (even those with a normal distribution) in order to keep consistency across variables. Or am I going to be limited to non-parametric correlations?

All advice gratefully received - I am pretty new to this so the learning curve is steep.
Many thanks in advance :)
 
#3
The distribution is non-normal, i.e. there is no "bell-curve" - it is very much skewed to the left (positive) for the three variables related to the questionnaire. One (bullying) is the DV and the two others are IVs (I'm looking at totals for the 3 questionnaire sections, so it is the distribution of the response totals that is non-normal). Hope I've not totally misunderstood something here!
 

Dason

Ambassador to the humans
#4
Well typically we don't actually care if either the DV or the IV are normally distributed. What we care about is if the residuals are normally distributed (technically it's the 'error' term that we want to be normally distributed but we can never truly observe that so we use the residuals as an adequate substitute for assessing whether or not that assumption is met).

So it's hard to say before you actually do the regression if you'll have a problem with something being 'non-normal'.
 
#5
I've had a go at a regression ... what I was worried about was that any results would not be valid if there was a non-normal distribution at the outset - so thanks for that! Is there a paper that could be referenced for that?

That then brings me on to stage 2 - in terms of the regression, what should I be looking out for? I pulled it up again (I'm using SPSS). I've worked an example using the Pallant text. So far I can see:
Multicollinearity = OK. Tolerance and VIF fine, correlation with both IVs, but not too much between IVs
P-P plot = close but not exactly to the line
Scatterplot mostly clustering around the middle but one "strange" looking trend
Everything else looking OK ... both IVs both making a significant contribution, one much more than the other as expected. I've attached it, just in case anyone has time to look at the "strange" plot! View attachment 1399 :tup:
Other IVs that I will be using at a later stage will be less problematic (I hope). I note in the Pallant SPSS Survival Manual that this non-normal distribution is common in the social sciences ... but what to do about it seems less straightforward. I have a good sample size and an interesting area (education and special educational needs).

Many thanks for help and advice - much appreciated :)
 
Last edited:

Dason

Ambassador to the humans
#6
Yeah your residual by predicted plot doesn't look too good. It looks like the response is bounded below. Can you describe the response a little bit more?
 
#7
The DV and 2 IVs are totals scores on a questionnaire. Each response was scored 0, 1, 2 or 3 (Likert scale) and a total score generated, but due to the nature of the topic, there is a real skew towards people answering positively (i.e. bullying, most are not bullied, so lots of totals with 0; similar for behaviour; the reverse true for positive relationships).

This could be of interest: http://gradworks.umi.com/32/43/3243035.html
And this, but it is getting beyond me: (wileyonlinelibrary.com) DOI: 10.1002/sim.4155

It's getting late here now, so will have to turn in. Thank you for the advice so far - if there are any other pointers you can add, I would be very grateful. I'm seeing my supervisor on Monday, so the more info I have the better. I am prepared to put in whatever work is necessary to expedite the analysis process, but see that this dataset could be a lengthy and complex procedure. This is for a preliminary (tentative) analysis to be presented at a conference soon, but the full dataset will be compiled in August. Looks like the steep learning curve will continue :)
 
Last edited:
#8
Hi,
Any further comments on the viability of my non-normal variables? I can run correlations using a non-parametric test, but would very much like to do a full regression if there is any way of doing it in a robust manner. If I go down the route of transforming my skewed variables, does it then mean that I have to transform all of my variables in the same way to keep consistency?
Thank you.