Similarity, Dissimilarity Test

#1
Hello all,
I am a graduate student. For my research, I am dealing with traffic data of different time period (such as June 2014 vs July 2015) of the same road at rush hours. So I have a set of data from June 2014 and the data for the same road during rush hours at July 2015. Now I would loke to compare these two sets of data, whether they are significantly similar or dissimilar.
For Example, I have 3 sets of data
Set 1 (August 20,2015)
1620, 1902, 2112, 1542, 1566, 1524, 1326, 1626, 1890, 1698, 1416, 1764, 2004, 2070, 1680, 1950, 1806, 1554, 2004, 1662,1686, 1512, 1530, 1248, 1452, 1596,1458, 1668, 1254, 1062, 1002
Set 2( July 24, 2014)
1710, 1656, 1980, 1920, 1824, 1520, 1298, 1572, 1930, 2052, 1380, 1654, 1954, 2106, 1732, 1902, 1788, 1806, 1710, 1728, 1686, 1548, 1542, 1266, 1404, 1308, 1320, 1452, 1074,1110, 1080
Set 3 ( June 12,2015)
1716, 1896, 1986, 1980, 1776, 1884, 1938, 2064, 1920, 2088, 2010, 1878, 1896, 1842, 1824, 1914, 1866, 1854, 1662, 1818, 1716, 1488, 1518, 1440, 1158, 1278, 1344, 1218, 1332, 1230, 990

when I plot them on excel, I got the following figure


Now, I would like to know, which statistical test/value can lead me to say that "Set 1 vs Set 2 are more similar than Set 1 vs Set 3" or "Set 2 data are more significantly similar to Set 1 than Set 3 data".
P.S.: I am using SPSS for analysis.

Thanks for your help
 
#2
Hi,

maybe you can use a t-test: For each measured time point, t, you can calculate the difference (Y_t - X_t) = Z_t (where X_t is e.g. from 2014 and Y_t from 2015) and subsequently you look via a one sample t-test if the Z_t values are different from zero. In this way you can get rid of the time dependence of the data and you dont have to model this dependency explicitely I think.

Maybe an (non-parametric) alternative is a two-sample Kolmogorov–Smirnov test. This test measures some kind of distance between the two cumulative distribution functions of two samples. However, I dont have much experiences with this test.
 
#3
Thanks for your suggestions. I have tried Paired T-test, but it did not give any conclusive result as it looked on mean value of the distribution.

How about Wilcoxon Signed Rank Test? Can it be used?
 

bryangoodrich

Probably A Mammal
#4
I would investigate the distance matrix that represents the pairwise distance between each day. Thus, each row and column represents a day to be compared; the cell value is the euclidean (or some other measure) point-wise distance for those two days. If two days are similar, we should expect that the square root of the sum of squared point-wise difference between each measure on those days would be small.

https://www.youtube.com/watch?v=dIJKBQbBCto

I assume they'll talk about it here, as it's the same data object you would use in hierarchical clustering. Not sure what test you might use, however.
 

bugman

Super Moderator
#5
Euclidean sounds like a good option.

Then use ANOSIM and pairwise comparisons between groups. Not sure if SPSS does this. You may need R or PRIMER.