Help with non-linear multidimensional scaling?

#1
Hi!
I'm doing some analysis of multiple versions of old texts. tracing their relationship to each other.

How I'm doing this so far - I created a spreadsheet with each version as each row, and each letter as a column.
There's a column for *every* letter. And I put either 1 or 0 in each cell to show if it's present in the document.

For example, if text A has 'old', text B has 'olde', text C has 'awld', it would look like this:
Screenshot 2020-05-17 at 20.29.58.png

I invented this method and then what I now know to be called a 'distance matrix', before I discovered that this system is in use.
I'm wondering if the way I do this is good or if there's a more efficient way. I used these simple numbers since it was within my capacity. And as I had to invent a way to create those distance matrixes, since these simple numbers were easy to work with.

Now I found 'Orange' graphing app (orange.biolab.si) which can actually generate distance matrices automatically from the above type data - great! Saves me time.
But I'm wondering if it also works directly with letters? For example would that work if I just made the spreadheet like this?
Screenshot 2020-05-17 at 20.46.45.png

And would it give identical results as the above? If so that would be convenient for keeping track of things, since I have more than 200 columns and about 60 rows. Also perhaps I should mention the language I'm working in is not using the English alphabet - so if it does work with letters, does it matter what kind of glyphs the input is? Will the distance matrix simply treat a matching glyph as a match and a non-match or absense as a non-match, thus staying in the binary method of recognition, if you know what I mean?

My next question would be, does anyone feel that there's a better way to do this task? I feed this into both non-linear multidimentional scaling graphs and also clustering plots. This helps me to trace the history of the variations.

Many thanks! Also to mention, I have no training in statistics and haven't done maths since I was 15 so it might be that I will find some difficulty understanding some specialist terms but I will greatly appreciate your help!
 
Last edited:
#2
I see that 28 people have read this first post of mine, but no responses. Did I post this in the wrong sub-forum? Help for a newbie much appreciated!

By the way, I ran a test and found that my binary method cannot be replaced with glyphs. This seems to be due to glyphs which represents more than one letter - perhaps having varying numbers of letters (as opposed to glyphs representing those letters inputted - sometimes need 2 or more letters in input to trigger a single glyph) causes the results to change.

So it seems I have to stick to my binary method.

Still, I would love any advice from anyone who has any experience in using MDS or clustering to analyse language, or DNA or protein sequences, which I think should follow the same principle. Or if anyone would be willing to chat for 5 minutes or so online? A short conversation with someone with experience should give me a great leap forward with this work!

Thanks!