View Full Version : What statistical test have I re-invented?


Goran_L
08-18-2008, 08:46 AM
As part of my PhD dissertation, I have come up with a new way to adress a common problem in my field (which is industry location). My method seems so simple and straight forward that it feels likely that it is a standard statistical method, but I can't figure out which one it is. I would need someone to tell me "Oh that's easy, that's just a typical Schmittmeyer's Q. It's nothing new - just Google it!"

Here is the problem and my approach.

Problem:

We have employment statistics for two industries - Red and Green - and for the entire population, called Anyone. The statistics is avalable for four regions R1-R4.

We now want to determine if industry RED tends to locate in the same region as industry GREEN.

Data example:


R1 R2 R3 R4 Total
Red 3 1 0 2 = 6
Green 2 3 3 0 = 8
Anyone 50 30 20 50 = 150

(The actual data I use has millions of employees in hundreds of industries and hundreds of regions.)

There are many standard ways to do this, but here is the approach I "invented":

Solution:

I use combinatorial probabilities and calculate the likelyhood for a Red employee and a Green employee to be in the same region, and compare that with the likelyhood for a Red to be in the same region as Anyone.

I calculate like this:

In region R1 there are 3 Red and 2 Green, and they form 3*2=6 Red-Green pairs
In R2: 1*3=3 Red-Green pairs
In R3: 0*3=0 Red-Green pairs
In R4: 2*0=0 Red-Green pairs
Total: 6+3+0+0=9 Red-Green pairs
Maximum possible number: 6*8=48 Red-Green pairs
Likelyhood 9/48=0.19

So there is a 19% chance that a random Red will be in the same region as a random Green.

If we now do the same for Red and Anyone we get:
Likelyhood = (3*50 + 1*30 + 0*20 + 2*50)/(6*150) = 0.31

So the chance for a Red to be in the same region as Anyone is much higher, 31%. This suggests that there is NO tendency for Red to colocate especially with Green, quite the opposite.

Mathematically, we can write this (pardon my poor syntax here, I haven't learned how to write proper formulas on the web yet):

P(i,j) = Sum for all regions r (EMPLri * EMPLrj) / [Sum for all regions r (EMPLri) * Sum for all regions r (EMPLrj)]

where
P(i,j) is the probability of a random employee in industry i to be in the same region as a random employee in indstry j
EMPLri is the number of employees in region r in industry i
EMPLrj is the number of employees in region r in industry j

There is more I could say about what this calculation can be used for, but it will probably not interest you. My question is simply:

Question:

Do you recognise this method? Could you tell me what statistical test it is I am performing?

(I checked some reasonable suspects, like Chi2, and it turns out that this one is similar to Chi2 but different.)

Many thanks!
Göran

TheEcologist
08-18-2008, 02:21 PM
I don’t really see that you have created a test, right now it seems to me that you are simply calculating probabilities and then comparing them. That’s not a statistical test.

Let’s divide statistical inference (broadly) into two classes; Modern Information theory (e.g. model selection) and classic statistics (tests of a null hypothesis). It seems to me that your problem is something for the classical tests, so we’ll ignore information theory.

Now if you have invented a new “test” in the classical sense..what are you testing for? Or better what are you testing against?

Classic statistical tests all “test” against a null hypothesis. For instance, there is no trend, the real difference is zero ect.

What is the null hypothesis you’re testing against?

That for one is important because it also helps define your test statistic and that would define which probability distribution corresponds to your test statistic. You want to differentiate if there really is a pattern, or if it is likely to occure simply by chance.

I think a logistic regression with green areas defined as a dummy variable could answer your question. Classify all areas that contain a red company as success (1) and areas without reds as fail (0) as your response/dependant variable. Then do the same for green companies, and use it as an explanatory variable. That should tell you if there is a (significantly) higher probability of finding a red company in an area with a green one.

Good luck

Goran_L
08-19-2008, 05:25 AM
I don’t really see that you have created a test, right now it seems to me that you are simply calculating probabilities and then comparing them. That’s not a statistical test.

Many thanks for your answer!

Yes, you are of course correct - it is not yet a test. For that, I need to know the properties of the statistica I have "invented". And since I can formulate the statistica, but not derive its properties, I am hoping that someone can tell me the name of what I have formulated, so I can look it up.

Your suggestion for the logistic regression is interesting, but it doesn't quite do the trick I'm trying to pull off, wich is using the individual employees as the unit of analysis, not the regions. If regions were the unit of anlalysis, I could use a simple correlation measure between Red and Green employment, but for various reasons I want a fundamentally different approach, which treats the employees as the observation. Hence my attempt with this probablity statistica.

It seems to me that my statistica is not very different from statisticas like Somer's d or Kendall's tau, which I believe are also combinatorial probabilities. (I'm sure I'm not using the right terms here, but I hope I'm clear anyhow.)

Will this formulation of the problem maybe help?

A box has a number of compartments of different (unknown) sizes. A large number of black balls have been dropped randomly into the compartments, so that the number of black balls reflect the size of each compartment: big compartments have many black balls, small ones have few black balls. The distribution of black balls is therefore a good reference distribution.

In addition, a (large) number of green balls and red balls have been placed in the compartments.

Problem: Is the distribution of red balls biased towards green balls, away from green balls or independent of the distribution of green balls.

To test this, we calculate the statistica G (X,Y) which is the probability of finding a random ball of colour X in the same compartment as a random ball of colour Y.

G(X,Y) = (number of X-Y pairs in same compartment) / (total number of X-Y pairs)

H0: G(red,green) = G(red, black)
H1: G(red,green) <> G(red,black)

What distribution does G follow? How do I calculate confidence interwals for the statistica G?

I hope this explains my questions better, and I am very grateful if anyone can point me in the right direction.