impact of factors of search engine rankings

I read a study that I think is very flawed. It's in German, though and it's probably flawed intentionally by the webmaster, but it raised a question Ive asked myself before in another context. Ill use the search engine example:

Say you have 10 factors that affect a search engine ranking. Can you now look at 100,000 websites (should be statistically significant at very high confidence level) and only look at 1 factor and if it correlates with higher search engine rankings?

For example, having a keyword in the page title and in the header tags helps with search engine rankings for that keyword (its a simplification). Can you now look at 100,000 websites and see how having the keyword in the page title multiple times correlates with high search engine rankings?

And expect that there will be a negative correlation between search engine rankings and number of times the keyword is mentioned on the page - as all the other factors should follow a random distribution (exp?)?

I assume this approach would only work if the variable we look at is in no way correlated with any of the other 9 factors for the search engine ranking?
Is this simply multi-collinearity (exp) in a regression model? And I would have to use specific tests to detect possible multi-collinearity?

I assume unless I have run a regression with the factors (which b/c of the 200 factors or so in search engine optimization would be close to impossible) that would pose another problem?
You are regressing search engine ranking on the 10 factors that affect it, right? And one of those variables which should be positive shows up negative, right?

From my experience, sometimes wierd things happen when a lot of variables are in the equation.

I don't think this is multicollinearity, I think multicollinearity only causes the t-statistic to be low, so the coefficient is not statistically significant. I don't think it changes the sign of the coefficient.

You don't have to have zero correlation between the variables. For multicollinearity to exist the r^2 between two variables should be pretty high, about .9.

I suppose you could add variables in the equation one by one, and see when wierd things start to happen, then you will know the "problem" variable.
I know a lot of guesses in my post, I hope somebody else replies.
Thanks, I think you misunderstood me a bit, though:

There are about 200 factors that affect a ranking in Google and this company analyzed 100,000 websites (alledgedly) and looked which of a sample of factors they looked at were given in the websites that ranked high (hope this makes any sense?).

So they always looked at one factor out of those 200 (or so) factors and they drew a chart for that one metric on the y-axis and the ranking on the x-axis...their theory was basically that if a certain factor was more prevalent in high ranking websites than in low ranking ones that factor should play a role (they did this for each of the 10 or so factors they looked at..always in isolation)...

and Im wondering if it's really as simple as that.

Another example (for which no search engine knowledge is necessary) would be this:

This is a flawed assumption but lets just say taller people can run faster.

Let's look at 100,000 sprinters and look how there height (measured in centimeters) affects there quickness (measured in their 100m dash time).

We would expect a negative correlation of these 2 metrics (because we measure 100m time..and the lower it is the faster the person is..sorry for not being able to think of an example with a positive correlation right now).

If we find such a correlation between those 2 metrics in a chart, can we assume it is true...and assuming that all other factors that can affect speed dont play a role b/c the sample size is big enough (100,000)?

P.S.: I just realized that this search engine study might also be flawed in that...there's a correlation, but correlation doesn't necessarily mean causation, right?
Yes, I think misunderstood you :).

P.S.: I just realized that this search engine study might also be flawed in that...there's a correlation, but correlation doesn't necessarily mean causation, right?
Yes, of course. A and B can be correlated, but both A and B can be caused by C.
thx, I dont know that much about statistics, yet..Im mostly an internet marketing/web analytics guy..and even though I wouldnt need any advanced statistical techniques being very solid at the basics does help thats why I ask such beginner questions ;)


A and b can be caused by C!! Im glad to hear that b/c it means the assumption I made the other day was right: If people get older they lose hair. They also earn more money (at least the means will show that). However factor C here is age...and it doesnt mean that the more money somebody makes the more hair they lose..nor does it mean that if somebody loses hair they will earn more money LOL.

But how about..for example the correlation between height & income? Can we assume that height and income are most likely not correlated with a third factor C..and as income will not make people grow (which could be tested for, too..), being taller really helps earn more money on the job?

How about this search engine study: Can the correlation between a certain factor (keyword in the page title for example) and a high search engine ranking really just be there, because they are both caused/correlated with the knowledge about search engine optimization of an individual? Meaning....Somebody who places the keywords in the title will also do many other things to get a good search engine ranking and thus a single factor really isnt the cause for the high rankings?

What does correlation really tell us? It seems as if would mean pretty much nothing if we cant find a causation...but I assume it can give us an idea and a new hypothesis for causation to test?
Last edited: