# Finding Correlations

#### Ewa

##### New Member
I'm trying to find some clusters/ segments or correlations between data based on survey results. I've attached a simplified version of the data. Can anyone help me make sense of it? I would be most obliged. Thanks.

#### Attachments

• 4.1 KB Views: 7

#### obh

##### Active Member
Hi Ewa,

I'm not sure if this is what you after?

Following the correlation matrix ( I removed two partial rows)|

Y-Design X1-Food X2-menu X3-existing X4-Tech

#### Ewa

##### New Member
Thank you for replying to my question, which clearly is not stated very clearly :S After watching a number of youtube videos, I learned that the matrix shows a negative correlation between design abilities, food and menu creation, and tech savviness. What does that mean exactly? What else can I discern from this? Is there a way I can segment this population based on this information?

#### obh

##### Active Member
Hi Ewa,

First, not all the correlation values in the matrix are necessarily significant ...need to be proved. (but the sample size is large)
For example, -0.004 may not be significant. (I didn't test)

The method is usually the opposite, first, you think what you want to achieve and then you use the statistics.

What do you look for?

#### Ewa

##### New Member
I'm trying to find a way to segment the population based on the answers in the survey, so I can understand people's behaviors, their needs and frustrations in order to build personas. But I have no inkling about how to parse the data I've collected. If it were only a few responses, I could probably wing it, but with over 300 I'm at a loss. I've asked all the right questions, I just need to figure out how to turn this data into meaningful clusters of people who interact with the system in similar ways. Does that make sense?

#### obh

##### Active Member
So you want to group together (1,1,1,2,8) with (1,2,1,3,9) and (4,4,4,4,4) with (3,4,5,4,5)?

#### Ewa

##### New Member
For a second there I thought I understood: I figured that the numbers in parentheses represented values in my columns. But the 8 and 9 threw me off. Am I looking at two segments? One that represents people who hardly ever create or edit menus and are not very tech or design savvy and another that represents people who create and edit menus often and are quite tech and design savvy?

#### noetsi

##### Fortran must die
Its probably not a very helpful answer but I would run factor analysis Structural Equation models are even better, but that takes most years to learn.

I would also review the literature on this topic, someone may have already made suggestions. Generally you have a theory before you run the data.

#### Miner

##### TS Contributor
Factor analysis will help you condense the number of questions down into a smaller number of factors underlying the questions. Your data appear to condense to 2 factors of importance.

You can also go directly into Cluster analysis. I recommend Cluster Observations. This should accomplish what you requested. After a cursory look at your data, you should be able to easily condense these into 13 clusters of differing customer types.

#### Ewa

##### New Member
Thanks for all your help! I'm afraid I need more tangible answers. You say my data appear "to condense to 2 factors of importance"? What are they? I should be looking at 13 clusters? How did you get 13? Is it possible to condense even further? How do I perform cluster analysis? Do I need specialized tools to do it?

Are there any resources available out there... written in a language I can understand? Tutorials? Anything? I would also be willing to provide compensation for someone to teach me.

#### Miner

##### TS Contributor
After running a factor analysis, the 2 questions (tech savvy and rating design) consolidate to a single factor, the two questions (create new food and create new menus) consolidate to a second factor, and the question (edit existing) falls out as not significant.

Using cluster observations and plotting the number of clusters vs. similarity, there is a natural break at 13 clusters. Selecting the number of clusters is a combination of judgment and domain knowledge of what the theoretical number of cluster SHOULD be. If there is no theory, the concept of diminishing marginal returns comes into play. The text file shows the cluster number by response ID.

#### Attachments

• 7.1 KB Views: 1

#### Ewa

##### New Member
So the way I should be reading this is that responses 7, 19, 31, 35, 56, 89, 118, 143, 171, 180, 268, 269, 276, and 342 all belong to the same cluster.... and so on and so forth.... Is that correct? You said there were 13 clusters but the spreadsheet shows 18. Does that mean that the similarity between the last 5 is negligible?

Last edited:

#### Miner

##### TS Contributor
Sorry. I did not attach the correct (latest) file. Here it is. And your interpretation of the responses vs cluster membership is correct.

#### Attachments

• 7 KB Views: 1

#### Ewa

##### New Member
Thank you!!!!! This is soooo helpful I could kiss you! What would I need in order to spit out this spreadsheet, dendrogram, and scatterplot (besides an advanced degree in mathematics )? Can I do it in Excel? Do I need another software? When I tried scatterplot in Excel it only graphed one column of data. Can someone point me in the direction of Cluster Analysis for Dummies or just provide a step-by-step instructions on how to generate the lovely spreadsheet above? And again I am super grateful! Thank you for tolerating my ignorance!

#### Miner

##### TS Contributor
I performed the analysis in Minitab. Excel does not have the capability to perform a cluster analysis. You can generate the scatter plot in Excel using a scatter (X-Y) plot (see attached), but would need the output from other software.