Multi linear regression

Hello everyone, I am new here (and to stats in general).

Anyway I am working on a project where I want to predict the value of an outcome and I have a few variables known beforehand that might be related. This is how I planned to go about finding a formula that will more or less predict that outcome, is it correct?

List potentially influential variables
Gather samples (I can do as many as I like)
Calculate the Pearsons correlation between each variable and the outcome
Discard all obviously unrelated variables
Linear regress the first variable, then regress on the residual from that variable etc
Match the full regression against a regression excuding each variable to test importance

Is this how I should go about doing it? How do I know that a multi linear regression is the way to go? And is there any free software out there that would make my life easier?


JohnLock said:
Is this how I should go about doing it? How do I know that a multi linear regression is the way to go? And is there any free software out there that would make my life easier?
You seem to be describing hierarchical multiple regression and are looking at changes in R squared eliminating previous variables. This is an approach I often use but as far as answering if this is the way to go... You haven't really provided any information about your data or research questions which are what dictates choices about a test to be used. The best way to know if you're making a sound choice about the test you use is to become educated on the purpose, strengths and weaknesses of each test. This is a
researcher decision that must be justified.

Now onto the free software that will make life easier. One letter R. We have a thread for R resources HERE. Initially the program/language takes some time but once you get it you'll wonder what you did before it.
So when I studied my bachelor in economics I didn't pay attention to statistics, the way it was explained, it seemed more like an academic exercise than a useful tool.
Now wiser, and soon to begin my master, I am beginning to really regret not knowing statistics and since I only really feel i have learned something is when I have done it, I have made it a personal project to model European style options. I expect there isn't any very reliable formula that can do it, but the exercise is worth the try.

So when you buy a (call) option, you get the right to purchase a stock at a later date at a now determined price. So lets say Company X trades at 100 and you buy 1 option for 10 with an expiration date in one month. At expiration date you can now buy the stock at 100, no matter what everyone else is trading it at. So if everyone else is trading at 120 you can buy the stock for 100, sell it and profit 20 minus the initial 10. If the stock trades at 80 on expiration date, you would be better of not buying it for 100 and you have lost your initial 10. In practice, you just get paid whatever you could have earned based on the current stock price, at expiration date (no need to buy and sell).

Anyway you can of course sell your option to others, and the price of your option is determined by the chance of being profitable at expiration date. It is these fluctuations I would like to try to model. I just realized that I might need another model since the price will move more and more towards one of two extremes, either 0 when the option looks to be worthless or towards the share price - minus the exercise price when profit looks to be made.

Did it make any sense?
Is there no one who can point me in the right direction as to what I should do?
My thinking right now is to continue with the linear regression approach, but having one of the variables be time left.

What do you think?
Thank you for your reply.
You are right, I have never really done anything stats or regression related before.
I have read up on multi linear regression as best I could and I am pretty sure I would be able to do it.
I just needed to know if my it indeed was the way to go in my particular situation.
Anyway I will try it out the way I initially wanted to, and see how it turns out.
Maybe I will have another question along the way :)


Ambassador to the humans
I'm not a huge fan of your original plan.

If your ultimate interest is in real scientific progress, I'd suggest that you ignore that sentence (and any conclusion drawn subsequent to it).
-- Andy Liaw (in response to a question on the meaning of the sentence: 'Independent variables whose correlation with the response variable was not
significant at 5% level were removed')
R-help (March 2010)
That quote summarizes my thoughts on the reason I'm not a fan of the original plan.

Note that it's quite possible for variables that are "obviously not related to Y" to end up to be important predictors in the presence of other variables.