I am new to stats and this may be very easy for Seniors. In my regression analysis I want to find, how “% change” is affected by other variables, the day of the month (Sunday, Monday etc.) and the occurrence number of that that day in that particular month (first occurrence of Sunday=1, second occurrence=2 so on and so forth for other days too). Please note and not confuse yourself by thinking the number as the week number in which the particular days is in (Sunday of week one, Sunday of week two and so on).

I have few questions here:

1) In the dataset there are some “% change” values that are deviate from the small values that it generally has (5% and over are considered as outliers). These large values are not errors but are part of the business and they have occurrences now and then. Should I remove large values in regression? Will their make presence affect my regression? If you suggest reducing the value of the large values, how should I do that? Also, you will see on Saturday, Sunday and holidays the %change is 0% as there is no business on that day. Should I remove this too in the analysis? How will it affect if removed?
Also I am using Linear Regression hope it is the type of regression that I should use. Correct me if I am wrong.

The data below has data of 4 years of Month of April 2007, 2008, 2009 and 2010. Ignore the 1st column as I just gave it for your reference. Want to find how column 4 is affected by column 2, 3. So, basically I want to find if I can get some pattern out of April and make predictions about April 11, April 12 and so on.

Date Occurrence # Day %change

04/01/2007 1 Sunday 0.00%
04/02/2007 1 Monday 3.57%
04/03/2007 1 Tuesday -1.32%
04/04/2007 1 Wednesday-1.88%
04/05/2007 1 Thursday -1.95%
04/06/2007 1 Friday 4.05%
04/07/2007 1 Saturday 0.00%
04/08/2007 2 Sunday 0.00%
04/09/2007 2 Monday 1.59%
04/10/2007 2 Tuesday 1.27%
04/11/2007 2 Wednesday-0.64%
04/12/2007 2 Thursday -0.39%
04/13/2007 2 Friday -1.01%
04/14/2007 2 Saturday 0.00%
04/15/2007 3 Sunday 0.00%
04/16/2007 3 Monday 3.77%
04/17/2007 3 Tuesday 1.30%
04/18/2007 3 Wednesday-1.47%
04/19/2007 3 Thursday -0.40%
04/20/2007 3 Friday 0.15%
04/21/2007 3 Saturday 0.00%
04/22/2007 4 Sunday 0.00%
04/23/2007 4 Monday 1.01%
04/24/2007 4 Tuesday -9.59%
04/25/2007 4 Wednesday-1.07%
04/26/2007 4 Thursday -0.60%
04/27/2007 4 Friday -0.88%
04/28/2007 4 Saturday 0.00%
04/29/2007 5 Sunday 0.00%
04/30/2007 5 Monday 5.31%
04/01/2008 1 Tuesday 0.29%
04/02/2008 1 Wednesday-0.28%
04/03/2008 1 Thursday -1.48%
04/04/2008 1 Friday 1.25%
04/05/2008 1 Saturday 0.00%
04/06/2008 1 Sunday 0.00%
04/07/2008 1 Monday 2.03%
04/08/2008 2 Tuesday -1.49%
04/09/2008 2 Wednesday-0.93%
04/10/2008 2 Thursday -0.39%
04/11/2008 2 Friday -0.21%
04/12/2008 2 Saturday 0.00%
04/13/2008 2 Sunday 0.00%
04/14/2008 2 Monday 2.67%
04/15/2008 3 Tuesday -1.60%
04/16/2008 3 Wednesday-1.24%
04/17/2008 3 Thursday -0.60%
04/18/2008 3 Friday 0.12%
04/19/2008 3 Saturday 0.00%
04/20/2008 3 Sunday 0.00%
04/21/2008 3 Monday -0.10%
04/22/2008 4 Tuesday -1.13%
04/23/2008 4 Wednesday-1.52%
04/24/2008 4 Thursday -1.07%
04/25/2008 4 Friday 0.48%
04/26/2008 4 Saturday 0.00%
04/27/2008 4 Sunday 0.00%
04/28/2008 4 Monday 3.65%
04/29/2008 5 Tuesday -0.66%
04/30/2008 5 Wednesday-1.69%
04/01/2009 1 Wednesday0.57%
04/02/2009 1 Thursday -1.19%
04/03/2009 1 Friday 0.41%
04/04/2009 1 Saturday 0.00%
04/05/2009 1 Sunday 0.00%
04/06/2009 1 Monday 1.60%
04/07/2009 1 Tuesday 0.30%
04/08/2009 2 Wednesday-0.96%
04/09/2009 2 Thursday -0.45%
04/10/2009 2 Friday 10.88%
04/11/2009 2 Saturday 0.00%
04/12/2009 2 Sunday 0.00%
04/13/2009 2 Monday 1.40%
04/14/2009 2 Tuesday -0.17%
04/15/2009 3 Wednesday-0.88%
04/16/2009 3 Thursday -1.44%
04/17/2009 3 Friday 0.69%
04/18/2009 3 Saturday 0.00%
04/19/2009 3 Sunday 0.00%
04/20/2009 3 Monday 1.76%
04/21/2009 3 Tuesday 0.35%
04/22/2009 4 Wednesday-1.35%
04/23/2009 4 Thursday -0.89%
04/24/2009 4 Friday 0.68%
04/25/2009 4 Saturday 0.00%
04/26/2009 4 Sunday 0.00%
04/27/2009 4 Monday 1.84%
04/28/2009 4 Tuesday 1.86%
04/29/2009 5 Wednesday-0.59%
04/30/2009 5 Thursday 0.04%
04/01/2010 1 Thursday -0.84%
04/02/2010 1 Friday -0.69%
04/03/2010 1 Saturday 0.00%
04/04/2010 1 Sunday 0.00%
04/05/2010 1 Monday 1.59%
04/06/2010 1 Tuesday 0.58%
04/07/2010 1 Wednesday0.67%
04/08/2010 2 Thursday -0.19%
04/09/2010 2 Friday -0.18%
04/10/2010 2 Saturday 0.00%
04/11/2010 2 Sunday 0.00%
04/12/2010 2 Monday 1.26%
04/13/2010 2 Tuesday -0.57%
04/14/2010 2 Wednesday-1.20%
04/15/2010 3 Thursday -1.62%
04/16/2010 3 Friday -0.31%
04/17/2010 3 Saturday 0.00%
04/18/2010 3 Sunday 0.00%
04/19/2010 3 Monday 1.23%
04/20/2010 3 Tuesday 1.72%
04/21/2010 3 Wednesday-1.03%
04/22/2010 4 Thursday -0.79%
04/23/2010 4 Friday -0.81%
04/24/2010 4 Saturday 0.00%
04/25/2010 4 Sunday 0.00%
04/26/2010 4 Monday -0.20%
04/27/2010 4 Tuesday 0.08%
04/28/2010 4 Wednesday-0.28%
04/29/2010 5 Thursday 0.93%
04/30/2010 5 Friday -0.79%

If you can please help me with this example other similar examples would be a cake walk.

Regards,
Amandeep

You shouldn't eliminate data points that are valid (that is ones which were not created by say a mistake in coding) without an extremely good reason. Even if they are outliers.

3. ## The Following User Says Thank You to noetsi For This Useful Post:

amandeepsharma89 (06-09-2012)

Thank you Noetsi, what about Saturday and Sunday they have always 0% change can I remove them? Also, linear regression should be the way right?

Amandeep

I also had a question here, do I have to treat Saturday, Sunday, Monday etc as different variables? Or have day as a single variable???
Waiting for you replies, in a fix.

Regards.
Amandeep

Also, should day occurrence number be treated as a different variable like 1,2,3 etc as different variable.

Please Seniors in a fix and have to solve it really quick.
Waiting for replies.

Amandeep

Amandeep,

If the business is closed on Sat and Sun, and if that is why %Deviation is 0.0 on weekends, then they probably should NOT be included in your analysis.

Do not otherwise modify your data. Outliers are outliers, and should be included. There are more advanced ways to deal with them that are simply not worth your time exploring until you understand the basics of regression better.

Presumably, you have a theory for why the day of the week, and the number occurences of that day in a given month, affect Percent Deviation. If so, then your model will look something like this:

PctDev = a + b1*Monday + b2*Tuesday + b3*Wednesday + b4*Thursday + b5*occ1 + b6*occ2 + b7*occ3 + b8*occ4 + error

After you've obtained your estimates, you can check them for robustness to outliers by removing said outliers one or two at a a time and rerunning your regression and compare your results to the model as it runs on the full data set. Or you can use your proposed decision rule "exclude if PctDev > 5.0" and see if your estimates change. This will tell you whether the outliers are driving your results. If so, that is not a failure of the data or the model; rather it is something you need to include in your report.

9. ## The Following User Says Thank You to eyesack_kn For This Useful Post:

amandeepsharma89 (06-09-2012)

Hi eyesack_kn that is a very good explanation to my query. Thanks a lot for your prompt reply. I got what I have to do. I have one problem now i.e. the restructuring of the data to put in a tool. I my post you saw how the data is structured. How would you suggest me to structure it so that I can run it in excel or weka.

How should I structure it so that I can analyze it in the software. It would be of great help if you can reply this, the reason being I have to analyze tons of similar data and make a report out of it.
Also, let me know if it is fine if I PMed you.

Regards,
Amandeep

No PM'ing unless I'm getting paid to consult =)

Presumably you wish to create a bunch of indicator variables for each possible value of "occurrence" and "day of the week". A series of columns in excel with "=if(day = "Sunday",1,0)" would work. I don't know anything about weka.

Thanks a lot eyesack_kn, can you provide me with an sample for only a month say April 2007 of how to create it for occurrence and day and match the corresponding %change value. I can take it forward from there. I would be of great help, if can attach such an example.

Regards,
Amandeep

Not sure I understand what you're asking. I'm not going to write out all the commands you would need to do this in Excel; you should be able to figure out how to do that based on what I said above. If you're asking what the data would look like, it would be N rows of indicator variables taking on the value of 1 for the occurence and day of that observation, and zero everywhere else. You could also create dummies for month and/or the year, if you wanted. The rest is just Excel functions though, and Excel has an excellent help file to assist you with those matters. Good luck!

14. ## The Following User Says Thank You to eyesack_kn For This Useful Post:

amandeepsharma89 (06-09-2012)

Ok got it. That is what I was asking for. Thanks a ton again!!

Regards,
Amandeep

Hello Seniors,
Thanks for your help. I was able to solve the problem and get the result as following

Multiple R 0.408429271
R Square 0.16681447
Standard Error 0.008296363
Observations 108

ANOVA
df SS MS F Significance F
Regression 10 0.001364278 0.000136428 2.477634319 0.010937384
Residual 99 0.006814135 6.88296E-05
Total 109 0.008178412

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 0.000987554 0.002315966 0.426411062 0.670734106 -0.003607826 0.005582933 -0.003607826 0.005582933
X Variable 1 0 0 65535 #NUM! 0 0 0 0
X Variable 2 -0.001625844 0.002560313 -0.63501741 0.526881998 -0.006706061 0.003454373 -0.006706061 0.003454373
X Variable 3 -0.000170694 0.002534512 -0.067347844 0.946440618 -0.005199715 0.004858327 -0.005199715 0.004858327
X Variable 4 -0.001044758 0.002534512 -0.41221271 0.681074747 -0.006073779 0.003984263 -0.006073779 0.003984263
X Variable 5 0.00552943 0.002534512 2.181655038 0.031501592 0.000500409 0.010558451 0.000500409 0.010558451
X Variable 6 0 0 65535 #NUM! 0 0 0 0
X Variable 7 0.000847719 0.002346566 0.361259256 0.718675457 -0.003808377 0.005503814 -0.003808377 0.005503814
X Variable 8 -0.003499521 0.002346566 -1.491337328 0.139052552 -0.008155617 0.001156574 -0.008155617 0.001156574
X Variable 9 -0.004545338 0.002346566 -1.937016822 0.055593767 -0.009201433 0.000110758 -0.009201433 0.000110758
X Variable 10 0.002046103 0.003378997 0.605535541 0.546208916 -0.004658561 0.008750767 -0.004658561 0.008750767

There were some dependency which were correctly shown. However, the value of R^2 is really less. What does that mean. How should I interpret that and write in the report. I know that less R^2 means less efficient.

Can someone please elaborate on the above result and explain the R^2 value.

Regards,
Amandeep Sharma