Finding a set of 3/4/5/6 uncorrelated variables

#1
Hi all,
I have a set of 20 variables that measure different aspects of objects. All variables are real numbers. My intent is to investigate how much these variables are different and what are the specific set of n variables that are less correlated with the remaining 20-n.
The problem with Spearman and similar is that it is just about pair. The problem with PCA is that the first component is of 17 variables and hence I do not know for example which are the set of 3 variables that are more different than the remaining 17.

To better explain my problem, suppose I have the following 4 variables related to a car: price, HP, MPG, Insurance cost, and the number of seconds for 0to60. How can I find the set of 3 variables less correlated to the remaining 2?
 
#5
Actually is not really what I am looking for. What I am looking for is to find the set of 3 variables that are more different than the reamining 17, then the set of 4 variables that are more different than the reamining 16, etc.
 

spunky

Can't make spagetti
#6
Actually is not really what I am looking for. What I am looking for is to find the set of 3 variables that are more different than the reamining 17, then the set of 4 variables that are more different than the reamining 16, etc.
You first need to define what "different" implies. Does "different" imply the variables have different means? Different variances? Different ranges?

Without a quantitative, measurable definition of "different", it will be hard to come up with a strategy.
 
#7
You first need to define what "different" implies. Does "different" imply the variables have different means? Different variances? Different ranges?

Without a quantitative, measurable definition of "different", it will be hard to come up with a strategy.
You are right. I do not have any specific measure in mind. Suppose that you do not have the money to measure all 20 variables to observe a phenomenon but only for 3 variables, how would you chose those 3?
 

spunky

Can't make spagetti
#8
You are right. I do not have any specific measure in mind. Suppose that you do not have the money to measure all 20 variables to observe a phenomenon but only for 3 variables, how would you chose those 3?
Depends on what you want to know, of course.

I am not being facetious, by the way. It seems to me your quandary has more to do with your research design than proper statistical approach. Design always dictates analysis.
 
#9
Depends on what you want to know, of course.

I am not being facetious, by the way. It seems to me your quandary has more to do with your research design than proper statistical approach. Design always dictates analysis.
Ok but you are not solving my clear example either :) I am flexible in what "different" means.
And, by the way, this data is mined and hence it is as is, I have not designed it.
 

spunky

Can't make spagetti
#10
Ok but you are not solving my clear example either :) I am flexible in what "different" means.
Of course not, because your example is not clear. Without a specific research hypothesis it's virtually impossible to help you decide what strategy you need. You need a specific hypothesis that you want to test, the hypothesis needs to be operationalizable (e.g. what does "different" mean?) and from there you can obtain an analysis strategy. It's part of the steps of the scientific method. You are leaving a lot of details out.

To make it more concrete, what you're asking would be conceptually equivalent to me as if I were to tell you: "I want to eat my favourite food, but I will not tell you what my favourite food is. Can you tell me what to eat?". Doesn't really give you many options as far as what to do, correct?

I honestly do not think you are very "flexible" with what "different" means. If you are, then here's an easy solution:

Define "different" as "difference in range" (i.e., max-min values in a variable). Find the range of each variable, order them from largest to smallest and there you go. The most "different" ones will be at the top of the list.

And, by the way, this data is mined and hence it is as is, I have not designed it.
That does not matter. Whether explicit or implicit, there's always a design. If this data came from, say, a telephone survey as opposed to an email survey, that's design for you. If it was collected with a computer as opposed to paper-and-pencil, that's design.

Quite frankly, there seem to be a few conceptual steps that you are either missing or not sharing which makes your question impossible to answer. Unless you truly are 'flexible' with what 'different' means, in which case I just provided you with a solution. ¯\_(ツ)_/¯
 
#11
Here is a clear example: i have a website selling cars, and I need to decide which variable to show for each car other than price, make and model; the problem is that I can show only 5 out of 20 variables for each car. For instance since HP is "correlated" with engine size then a user would likely not need both variables.
So I guess for different I mean lowly correlated.
 
#12
For selling cars the answer could be what 5 car features are the most important features for the specific audience to which you are marketing the cars to. After obtaining the preferences hierarchy from representative sample market research, you want to know how to decide which 5 out of the 20 to use? One might go with the top-5 selected features.
 
Last edited:
#13
For selling cars the answer would be what 5 car features are the most important features for the specific audience to which you are marketing the cars to. After obtaining the preferences hierarchy from a representative sample (market research including website testing), you want to know how to decide which 5 out of the 20 to use? If it's really about car features, go with the most-selected 5 features.
I know, but suppose you cannot survey the audience and you need to apply an unsupervised approach...
 
#14
Are you looking for a tool to compare all possible models (restricted to 5 independent variables (IV)) from a pool of 20 IV's, to obtain the 5-IV model with optimum [your desired coefficient here] ?
 
#15
Are you looking for a tool to compare all possible models (restricted to 5 independent variables (IV)) from a pool of 20 IV's, to obtain the 5-IV model with optimum [your desired coefficient here] ?
I think I am looking for a technique that provides me the 5 variables, out of 20, that are the least correlated among them 5 and/or the most correlated with the remaining 20-5.
 
#17
You say least correlated among them. Do you mean 'least correlated with the response (dependent) variable'?
I do not have a dependent variable. The variables are unsupervised. Otherwise I would have used infogain or other feature selection techniques. Thanks!
 

spunky

Can't make spagetti
#19
Here is a clear example: i have a website selling cars, and I need to decide which variable to show for each car other than price, make and model; the problem is that I can show only 5 out of 20 variables for each car. For instance since HP is "correlated" with engine size then a user would likely not need both variables.
So I guess for different I mean lowly correlated.
Ok. So... would calculating the correlation matrix of all the variables and sorting the correlations from top to bottom (by absolute value) answer this question? You would just have to look at the variables at the bottom of the list (i.e., those that are the least correlated) and use those.