Hello, this tutorial may be of some use to you: https://www.youtube.com/watch?v=kFasmbrDG5I
Gender - code them as 0 (Male) and 1 (Female)
a few basic questions. I'm just starting to learn regression so I'm trying to apply what I'm learning.
I have a set of data with ages ranging from 15 to 75, and the number of car accidents they have.
I scatterplot my data (age on the X axis, # of accidents on Y axis ). The result looks like a curve. # of accidents increase gradually from age 16, peaks at around 24, and decreases until 75, and zeros out. correlation is about -.75 according to Excel.
1st question:
The # of accidents is not normally distributed (it skews to the left a bit). Can I still do a regression analysis on this? my 'stats' book says it needs to be normally distributed. If so, how do I resolve this?
2nd question:
If it doesn't need to be normal, how do I explain the data? since the data looks like a curve, I applied a polynomial trend line to it. But when I plugged in random ages to see if it make sense, it seems to be undershooting the number of accidents. I increased the polynomial order to 6 and the graph line seemed to fit the scatterplot better, but when I plug in certain age ranges, the result is a negative # of accidents (which does not make sense).
3rd question:
I want to add gender to my regression. doesn't my independent variable need to be numerical? how do I apply gender if that is the case?
Hello, this tutorial may be of some use to you: https://www.youtube.com/watch?v=kFasmbrDG5I
Gender - code them as 0 (Male) and 1 (Female)
thanks. that video helped. I was able to move forward but I have a couple more questions:
1. So I understand that it is the residuals that need to be normally distributed and 'random' in order for my model to work. Once I ran the regression in Excel, I also plotted my residuals, and there is a clear pattern similar to my actual Y values, so it means that my model is not good. How do I fix this issue? Do I need to pick different independent variables? will adding more variables help? What does it mean exactly?
2. For the non-numerical variables, am I allowed to use more than one? for example, I have male = 0, female = 1. If I want to add marital status, do I just use 2 more dummy variables? I,e married = 3, single = 4?
The residuals of the regression [if that is what you are doing] need to be normal for linear regression not the accident rates. You would plot the residuals and see if they look normal. Note this only effects the confidence intervals and not the other assumptions of regression. So your slope estimates are not effected, but your ability to decide if the results are statistically significant are.
If there is a pattern in your data you need to figure out why. It might be non-linearity, this is solved commonly by adding a quadratic term, it might be the predictors have another non-linear relationship - this can be dealt with by transformations [which also are commonly used to deal with non-normal residuals].
I am not sure how you are "explaining" the data if all you are doing is plotting accidents. If you have predictors you explain the data by their impact on accidents and their test of statistical significance. You never really explain what your predictors are.
Regression does not require interval [what you mean by numeric] data. You can code male 1 and female 0 and add it to the regression as a predictor. The slope you get will be the mean difference in accidents between males and females.
Using excel for regression is less than ideal. If you can use another form of software that would be much better. Performing a box-tidwel test on the predictors you think are non-linear is a good start. Note that if you are plotting accidents over time you need to do a test for autocorrelation....Durbin Watson is a basic form, there are a lot better tests.
"Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995
Thanks for the reply. Apologize as I'm not fully understanding. My objective is to see what impacts age and gender have on accidents.
Right now, My data has 3 columns. Age, Gender (0,1), and accident counts.
Before doing any regression, I did plot accidents on the Y axis, and age on the X axis just to see what it looks like. What I get is a curve (accidents counts increase until around age 24 and then slowly decreases). the correlation is -.55, and I add another variable (gender) to the mix.
I do the regression and I get my result.
So since my data is a curve, are you saying that I need to add a quadratic term? do I just add a squared term? ?? not really understanding....
I think you are confusing normality with linearity to start with. And second you are looking the levels of the dependent variable at level of a single predictor when you should be looking at the residuals of the regression. You need to generate the residuals and look at them for violations of normality and linearity not the raw data.
Forget a scatterplot. They are useless entirely with more than one predictor. And of limited value with even one.
Do you have access to any software other than Excel? It is not a good tool for regression.
"Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995
Thanks.
And just to be clear(and feel free to let me know if I'm doing it wrong), I am looking at the residuals. Basically, I looked at the raw data just to see if there is any correlation, but when regress it in excel, I mark the checkbox to also display the residual plot. The result is that the residual plot curves in a similar way to my raw data.
As for alternative software, I do not have any other access. Maybe to R, but I would need to learn it first
I would learn R if you plan to do regression. Excel provides too few diagnostics. And in honesty I don't trust it, I know there are errors in other Excel add ons like solver.
I think you can do the Box -Tidwel test of linearity, you might look that up. I could not find any examples that did not involve logistic regression. Solutions for non-linearity include transformations, like Box-Tidwel, splines, and general additive models I believe although I have not worked with the latter in a while. Specifying a quadratic term may address the issue if it is caused by time. Try Box Tidwel test first.
"Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995
As for softwares, if you do not want to use R, you could give PAST (free) a try. It's a program devised for palaeonthologists but it is widely used by many researchers. It has a spreadsheet-like interface.
Just try googling for it, and be sure you download the new version (3.0 or newer)
Best
Gm
EDIT:
http://folk.uio.no/ohammer/past/
http://cainarchaeology.weebly.com/
Paleontology - I heard that is a dead science.
Stop cowardice, ban guns!
noetsi (03-08-2016)
Thanks. I just found out my workplace has access to SAS EG, so I'll try to learn that since that is what my company uses.
If you know the statistics EG is a piece of cake. Point and click...
"Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995
semidevil (03-09-2016)
Thanks. I'm actually learning Statistics on my own, and trying to learn EG so I can apply it, so I'm still a beginner here.
So...with that in mind, I do want to ask if my steps and process for regression is correct(Im using a basic stats book for reference).
1. Pick Dependent and independent variables
Pick the independent variables. Going back to square 1, I have my dependent variable as 'accident counts'. I have age as my independent, and I went ahead and added two other independent variables (marital status (0,1), Gender (0,1).
2. check correlations.
I checked the correlations of my dependent with my independent in SAS EG and I'm getting -.334, .033, and .43. This tells me that there is not much correlation. If that is the case, does this mean I need to scrap my dependent variables and pick ones with higher correlation?
One problem is that linear regression [which is what I am assuming you are using, PROC REG or PROC GLM does this in SAS] does not assume that the DV is a count. Formally you are supposed to use something like Poisson regression for counts. But I don't think that matters for this type of count data [ I am not an expert in that so you might want to check].
If you are running regression you are not looking at the correlations. You are looking at the slopes and the t test [the P values particularly]. Look at the p values and see if they are below .05. If they are than this likely is a useful variable although you need to run diagnostics like VIF, look at the qq plot of the residuals for normality etc. Only if the assumptions are met can you assume the results matter.
Here is an example of me doing this, these include a series of diagnostics for regression assumptions. The QQ plot generated is the best single test for normality.
I asked for White's SE because looking at the residuals suggested that there might be heteroscedascity so my p values might not work.
PROC REG DATA=WORK.SORTTempTableSorted
PLOTS(ONLY)=RSTUDENTBYPREDICTED
PLOTS(ONLY)=COOKSD
PLOTS(ONLY)=QQPLOT
PLOTS(ONLY)=DIAGNOSTICSPANEL
PLOTS(ONLY)=RESIDUALPANEL
/* Start of custom user code. */
PLOTS(MAXPOINTS= 11000 )
/* End of custom user code. */
;
Linear_Regression_Model: MODEL DV = Age14 OJT2 timeinserv EDDUM Female Sev1 Sev2 INCOST Race_W Ethnicity_H Private SEDUM TSA2 CM DD LD OR SE SA
/ SELECTION=NONE
TOL
/* Start of custom user code. */
WHITE
/* End of custom user code. */
;
RUN;
"Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995
Tweet |