multidimensional scaling (with clusters)

I am writing my bachelor's thesis which was based on a study by Buma, Bakker and Oudejans (2014). Unfortunately I just realised that I misinterpreted the way the had done their analysis so I am back to square one, therefore your help will be most appreciated!

I have 8 paricipants which together formed 51 statements about a situation (in my case this situation was musical performance under pressure). In order to gain information about how they see this situation, I have asked them to categorise these 51 statements into meaningful clusters. They were allowed to choose the number of clusters (at least 2) and they had to use all of the statemenst, but only once.

Now I have 8 different categorisations of these same 51 statements ranging from two different categories per participant to five different categories per participant. I would like to use multidimensional scaling (that is the analysis used in the original study. Do you recommend a different analysis?) to see which statements appear together in the same categorie more frequently. In the original study they got 6 categories and the results looked like this:

View attachment 6039

How do I compute a proximity matrix out of my data? The original study used it to perform multidimensional scaling and the coordinates obtained from it were used in Ward's hierarchical cluster analysis. How and why?
I thank you for your help in advance!

Best regards,


*Buma, L. A., Bakker, F. C. and Oudejans, R. D. (2014). Exploring the thoughts and focus of attention of elite musicians under pressure. Psychology of music, 20, 1-14. doi: 10.1177/0305735613517285
My data is not numeric. It looks something like that:
Participant 1: Category 1: Thoughts about the audience (I think about the size of the audience; I think about wether they like the way I play; etc.)
Category 2: Thoughts about the music (I think about the next note; I think about the feelings I would like to express; etc.)
Category 3: ...
Participant 2: Category 1: Bodily feelings (I feel my heart pumping; I want to stop trembling; etc.)
Category 2: Thoughts about emotions (I think about the feelings I would like to express; I am happy to perform; etc.)
Participant 3: Category 1: My instrument (...)
Category 2: ...
Participant 8: ... Category 5:...

So I have 8 participants and they have different categorizations (of the same 51 statements). Some participants only have 2 different categories, some 3, some 4 some 5 categories. Participants named each of the categories he/she had created, therefore each category has a different name.
Last edited:


TS Contributor
Do you think you can use MDS without numeric data?
Usually, you start with a table where observations (e.g., individuals) are cross-tabulated against some categories. Then you can apply a dimension reduction and or a clustering technique.

If you are not clear about the basis of the technique you wish to use, and if you do not provide people here with a clear description of your data and/or goals, it would be very hard to provide sound help.
That is exactly my problem. The data is nominal, but the authors (Buma, Bakker and Oudejans, 2014) have numerised it and I do not understand how. First they tell us the participants had to sort the statements into meaningful categories and than they jump right to processing those results into a proximity matrix. I do not understand the step in between - how did they numerise the data so that they could process it further.

They explained the computation of maps like this: "Results from the sorting task were first processed into a proximity matrix, which represents how often statements were clustered with each other by the participants. Multidimensional scaling uses this matrix to locate each statement as a point on a map (that is, the point map): statements located close to each other on the map have been clustered more frequently by participants (Borg & Groenen, 2005; Davison, 1983; Trochim, 1989a). The coordinates of each statement then served as input for Ward’s hierarchical cluster analysis which grouped the statements into different clusters based on these coordinates. [...] The end result was the cluster rating map" (Buma, Bakker and Oudejans, 2014).

I imagine the 8 participants must be on y-axis and the 51 statements on x-axis. How should I numerise the categories so that I could see which statements correlate more strongly? Let's say my first and second participan have 3 categories, third participant has two categories and so on. Is it ok if I numerate first participants categories 1, 2 and 3, second paricipants 4, 5 and 6, third participants 7 and 8, and so on? (Becouse I can not numerise them 1,2,3 for the first participant and than again 1,2,3 for the second one, becouse they do not have the same categories.)

I made an example below. Is this even useful? It is very difficult to explain ... I am sorry, but thank you so much for helping me.

View attachment 6040


TS Contributor
Why do not you crosstabulate statement vs categories? In other words, you should build a table with statement in rows and categories in columns (or the other way round, it does not matter), with each cell reporting the frequency with which each statement was 'put' into each category. Then, you should be able to analyse the data to see how statements cluster together, using MDS or other techniques. I could elaborate more on the latter, but it is better to know if the above makes sense to you.
It does make sense. The only problem is that all the frequencies are 1, becouse each participant has his own unique categories. If I crosstabulate statements and categories I get what is shown on image below. 1,2,3...51 are statements, A, B, C...AB are categories. Black vertical lines separate different participants (participant 1 created 5 categories, participant 2 created 4 and so on). Is this tabulation useful? You can still see that for example statements 1, 2, 3 and 6 frequently apear in a categorie together, so I suppose you can calculate their correlation? But I am not sure how ...

View attachment 6041


TS Contributor
The situation is not so clear to me as well.
I understand that you have a number of statements (rows of your table), and then you have a number of I do not understand what (columns of your table, I will call them "things"). The latter have been grouped in 8 groups, each one comprising a variable number of "things".

Now, provided that the situation is not altogether clear, at the best of my understanding you are faced with different choices:
a) assessing how statements relate to the "things" ----- > this implies to use MDS (or Correspondence Analysis, for example) to seek (for instance) for clusters among statements or among "things". In this case, there is no harm in having a presence/absence-type of data (i.e., 0s/1s)
b) assessing how statements relate to the 8 groups --------> this implies again using MSD (or CA) to seek for clusters. In this case you have to add up the columns ("things") belonging to each of the 8 groups, in order to eventually get a table of size 51x8 (i.e., statements by groups).

The »things« are categories of statements. Categories which were created by participants (8). I am sorry – it is not an easy one to explain :S
Yes, what I am looking for is your assumption numer one, I belive. I need to find clusters among statements according to how many times they were put into the same categorie a.k.a. »thing«. So statement 1 and statement 2 would be much more »closer« (becouse they were put into the same categorie / thing 8 times) than stat1 and stat20 (which were put into the same categorie/thing only 5 times). I need to make a proximity matrix, if I understand this correctly.
Now I am not sure how to put this data into spss … My first variable is the statement, and than I would have another 28 variables representing each categorie / thing. Is that correct?
I think I fugured it out! I would just need somebody to take a look at the procedure and tell me, if it is correct. I would appreciate it very much, becouse I am using this analysis for the first time.
The culters which I ended up with after Ward's hierarchical cluster analysis are very meaningful, but I have no other way of knowing if every step was done correctly.
My question was: How often did the paritcipants put certain statements in the same categorie? According to this correlation between statements - how many meaningful clusters can I create?
The procedure was as follows:

1. I inserted statements (n=51) as cases, and categories (n=28) as variables. Number 1 means a certain statement is present in a certain category and 0 means its absence.
2. I selected hierarchical cluster analysis (cluster by cases);
range of solutions: 5 (because that is the highest number of categories created by a participant);
method: Ward's,
measure: binary (squared Euclidian distance)

View attachment 6064

View attachment 6065

As I have said, the cluster are meaningful (the two at he top of the dedogram a bit less so, bit still)

Is this a correct procedure? Should I check for something else? Have I skipped something?

Thank you for your help in advance!