Research about stock comovements due to similar ticker symbols

Dear all,

I have a small question about a small research that I wanted to do regarding stock symbol similarities and their abnormal returns. I have about a sample of 10 000 stocks and I calculate how much they differ from each other (which gives a matrix of 10K x 10K with their "difference measure").

Now I was planning to test this by first grouping those stocks with a distance less than a certain threshold into a group and dropping those that have a larger threshold and then check whether these stocks have a similar return even though they are not in the same industry.

I have a few questions, is this a good method to test for this? Or should I use the matrix of 10K x 10K? Because I need to check for totally unsimilar stocks too right?

Furthermore am I forgetting something? I am trying to replicate the study done by Rashes (2001) on a larger scale.

Re: Research about stock comovements due to similar ticker symbols

With proprietary trading, if you are doing something simple, this is rarely a "good method"... But you may start with using hierarchical cluster analysis to split the stocks into several cohorts. The cluster analysis will be based on some variation of the distance measure you have developed.

Re: Research about stock comovements due to similar ticker symbols

Well, I am going to use the Levenshtein distance to measure stock similarity. Could you point me in the right direction what kind of statistical methods I should use?

I can in fact categorize all the stock symbols into their most similar groups, however this would give me a large number of groups and I was wondering how I should continue then?

I am not trying to use data mining techniques to do this, but I was more thinking like using regression techniques to confirm whether stocks with similar ticker symbols experience comovement behaviors.

Re: Research about stock comovements due to similar ticker symbols

I suggested hierarchical clustering over other types of cluster analysis because it allows you to control the number of clusters. There will not be too many.

Regressions are not appropriate here. When you decided to use the distance measure, you stepped into the unsupervised learning area yourself... The complexity of the optimal method is somewhat dependant on the amount of information. How many stocks are there in the picture? How many observations? Are you calculating the Levenshtein distance based on ups and downs or something else?

Re: Research about stock comovements due to similar ticker symbols

Well, suppose I have NASDAQ stocks set containing A, B, C (ticker symbol) and some OTC stocks A1, A2, B1, B2, C1, C2 (ticker symbol). Now what I want to do is investigate whether if A goes up, it causes A1, A2 goes up, while I expect if A goes up, B1 ... C2 does nothing. Furthermore I also expect that if A goes down, A1 and A2 goes down, while the rest does nothing. I correct for market effects by using only stocks that are not in the same industry. So, A*, B*, C* (where the star can indicate that there is a number or nothing).

Now let's calculate their Levenshtein distance:

A - A1 and A2 have a distance of 1 (one deletion or addition suffices)
A - B1 to C2 have a distance of 2, (one deletion and one substitution)
B - B1 and B2 have a distance of 1
...etc...

Now suppose I have the matrix with their distances, how should I proceed? While this results only in a matrix 3x3, I can download up to about 5K observations, for the "NASDAQ"-set and 5K obervations for the other, making only my distance matrix to be like 25K.

Now we are not yet talking about observations, I have about on average per stock worth 5 years of daily observations, which amounts to 2K, making about in total 5K of observations at least. I personally think this is a lot.

I noticed btw this is a bit similar to something called the gravity model used in economics. I think I rather opt for that option, or is there something I am missing?

I realized btw I have a bit been rude and haven't yet thank you for your responses so far and help. However I really appreciate you helping me, so thank you very much for your efforts!

Re: Research about stock comovements due to similar ticker symbols

Thanks but, I'm afrraid, some of my questions still stand:

1] How many stocks will you be splitting into groups?

2] How many observations (ticks/days/weeks) are used to calculate the distances etc?

3] Please eleborate on your calculation of distances. On which series are you calculating those? What are your X1 X2 ... Xn and Y1 Y2 ... Yn for any two stocks?

Re: Research about stock comovements due to similar ticker symbols

1] Not sure how many stocks, I haven't split them yet. I was going to split them based on similarities as I explained above. So all stocks that differ one character from each other is a group (or in my terms, distance <= 1).

2] I only use their ticker symbol to calculate the distance. In case they have changed their ticker symbol, I will have to check why they did that and decide there whether I will include them in my symbol. I think the crucial thing to this research is I am checking whether: RETURN_OF_SIMILAR_TICKER_STOCK_1 * B + CONST = RETURN_OF_SIMILAR_TICKER_STOCK_2. Previous researches only focussed on stocks that have similar names and exclude the stock that do not have similar names, which means their results have a selection bias right?

3] I do not think this is needed, but you can check it here (http://en.wikipedia.org/wiki/Levenshtein_distance), you can imagine it being a function f(x, y) where I input a string x (which is in this case ticker symbol for stock 1) and a string y (which in this case is a ticker symbol for stock 2) and I get their "distance of strings".

I hope you understand my problem now, if there are any questions, please let me know!

Re: Research about stock comovements due to similar ticker symbols

Before, when you said that you were calculating the Levenstein distances off stick symbols, I thought you were using word "symbol" incorrectly. Could not believe you would actually use names instead of real historical market performance. To me, an interesting alternative would be preprocessing price action in some discrete series (like ranked returns) and then applying Levenstein distances to those...

Anyway, for what you are trying to do hierarchical cluster analysis would be a good start. It could be done relatively easily in SPSS or R. Look up this methodology and you will see what I mean.