Hi!

I'm doing some analysis of multiple versions of old texts. tracing their relationship to each other.

How I'm doing this so far - I created a spreadsheet with each version as each row, and each letter as a column.

There's a column for *every* letter. And I put either 1 or 0 in each cell to show if it's present in the document.

For example, if text A has 'old', text B has 'olde', text C has 'awld', it would look like this:

I invented this method and then what I now know to be called a 'distance matrix', before I discovered that this system is in use.

I'm wondering if the way I do this is good or if there's a more efficient way. I used these simple numbers since it was within my capacity. And as I had to invent a way to create those distance matrixes, since these simple numbers were easy to work with.

Now I found 'Orange' graphing app (orange.biolab.si) which can actually generate distance matrices automatically from the above type data - great! Saves me time.

But I'm wondering if it also works directly with letters? For example would that work if I just made the spreadheet like this?

And would it give identical results as the above? If so that would be convenient for keeping track of things, since I have more than 200 columns and about 60 rows. Also perhaps I should mention the language I'm working in is not using the English alphabet - so if it does work with letters, does it matter what kind of glyphs the input is? Will the distance matrix simply treat a matching glyph as a match and a non-match or absense as a non-match, thus staying in the binary method of recognition, if you know what I mean?

My next question would be, does anyone feel that there's a better way to do this task? I feed this into both non-linear multidimentional scaling graphs and also clustering plots. This helps me to trace the history of the variations.

Many thanks! Also to mention, I have no training in statistics and haven't done maths since I was 15 so it might be that I will find some difficulty understanding some specialist terms but I will greatly appreciate your help!

I'm doing some analysis of multiple versions of old texts. tracing their relationship to each other.

How I'm doing this so far - I created a spreadsheet with each version as each row, and each letter as a column.

There's a column for *every* letter. And I put either 1 or 0 in each cell to show if it's present in the document.

For example, if text A has 'old', text B has 'olde', text C has 'awld', it would look like this:

I invented this method and then what I now know to be called a 'distance matrix', before I discovered that this system is in use.

I'm wondering if the way I do this is good or if there's a more efficient way. I used these simple numbers since it was within my capacity. And as I had to invent a way to create those distance matrixes, since these simple numbers were easy to work with.

Now I found 'Orange' graphing app (orange.biolab.si) which can actually generate distance matrices automatically from the above type data - great! Saves me time.

But I'm wondering if it also works directly with letters? For example would that work if I just made the spreadheet like this?

And would it give identical results as the above? If so that would be convenient for keeping track of things, since I have more than 200 columns and about 60 rows. Also perhaps I should mention the language I'm working in is not using the English alphabet - so if it does work with letters, does it matter what kind of glyphs the input is? Will the distance matrix simply treat a matching glyph as a match and a non-match or absense as a non-match, thus staying in the binary method of recognition, if you know what I mean?

My next question would be, does anyone feel that there's a better way to do this task? I feed this into both non-linear multidimentional scaling graphs and also clustering plots. This helps me to trace the history of the variations.

Many thanks! Also to mention, I have no training in statistics and haven't done maths since I was 15 so it might be that I will find some difficulty understanding some specialist terms but I will greatly appreciate your help!

Last edited: