What software can do this?

#1
I have a project that i'm working on involving some minor statistics which requires software to complete -- however, i'm unsure which software I should use for it.

I'll try to explain what it is i'm wanting to do. I have roughly 500+ data items. In each of these items are roughly 100 columns of data with labels.

For example, data item #1, named Jack, might have these columns of data:
Height (in.): 72
Age: 43
etc...

While data item #2, named Jill, might have these columns of data:
Height (in.): 60
Age: 68
etc...

I would like to set software up to compare each column in a data item to the same column of every other data item and look for similarities. For instance, if 80% of the data items had an Age column that was greater than the Height column, i'd want to know that. I'd also want to know if 90% of the Age columns were over 55. And finally, i'd want to know if there was very little standard deviation between the Height columns of all 500 data items. Basically, i'm looking for software that can be set up to run scans for these types of situations and others.

Can anyone recommend a piece of software that might suit these purposes? I would be very grateful!
 
#2
It appears that you are interested in associations between variables(columns in this case), as opposed to prediciting an outcome. So, It helps to know what associations you are interested in, then you can just make the calculations manually. However, you have too many variables for this. One solution to this problem is to use a data mining technique that finds "Association Rules". These rules will look something like this:

Age > 50, Height < 6.0, weight > 200, Sex = M, 75%.

Now to explain! This rule states that in 75% of your data items, their age is greater than 50, height is less than 6.0, weight is greater than 200, and sex is male.

Now a process that generates these association rules is called an "association rule miner" or "association rule finding algorithm". These algorithms will generate rules like the one above, where you can specify to what proportion of the data items the rules apply. Notice also that most algorithms will find rules that do not contain all of the variables. This makes sense since very complex rules will (probably) apply to very few data items.

So if this is the type of information you are interested in, there are computer programs that find association rules. One program that I have used to do a similar task is the open-source data analysis environment R. R cannot find association rules on its own, it requires a package called RWeka. However, R is a command line based program and not too simple for beginners. Another open-source software that is commonly used for data mining is Weka. This program has a nice GUI (graphical user interface) and is much easier for beginners. In either case, you may have to read a little from the manuals to make it work. The good part ... both of these packages are open-source and can be downloaded, used, and redistributed completely free.

~Matt
 
#3
Ahh, yes, I have actually used (or tried to use) R before for another project, but as you said, it was very difficult to understand, and there didn't seem to be a great deal of documentation on its many uses.

What you described does indeed sound like the type of software I am looking for. I had a little trouble describing my intentions clearly.

Thanks for the reply. I'll definitely give Weka a try -- if it is slightly easier to work with and understand than R, I should be able to figure out how to use it correctly.
 
#4
It appears that you are interested in associations between variables(columns in this case), as opposed to prediciting an outcome. So, It helps to know what associations you are interested in, then you can just make the calculations manually. However, you have too many variables for this. One solution to this problem is to use a data mining technique that finds "Association Rules". These rules will look something like this:

Age > 50, Height < 6.0, weight > 200, Sex = M, 75%.

Now to explain! This rule states that in 75% of your data items, their age is greater than 50, height is less than 6.0, weight is greater than 200, and sex is male.

Now a process that generates these association rules is called an "association rule miner" or "association rule finding algorithm". These algorithms will generate rules like the one above, where you can specify to what proportion of the data items the rules apply. Notice also that most algorithms will find rules that do not contain all of the variables. This makes sense since very complex rules will (probably) apply to very few data items.
I am actually interested in doing the exact same thing. I'm assuming that I would not have to specify the numbers 50, 6.0, and 200, but I could specify 75% and the program would find the commonalities?

Can SPSS do this type of analysis?

Apparently there are GUIs available with R as well, so it might not be more difficult than Weka if R can do it.
 

zeloc

New Member
#5
I'm going to go ahead and answer my own question.

It seems that association rules are what I am interested in, but they are mainly used in Data Mining with nonnumeric data, and the data in my example and the one above are for numeric data. It is possible to discretize the data in various ways, but I think the simplest way to get the information is to use data analysis instead of data mining for this purpose. I'm going to read about the various techniques in data analysis which is more geared toward the analysis of numeric data.