# Which distance metric should I use for county clustering?

#### mthelm

##### New Member
I'm trying to cluster U.S. counties based on the following characteristics:
1. Median wages
2. Unemployment rate
3. Average educational attainment
4. Population
Many clustering algorithms require the calculation of a distance matrix but I'm having trouble evaluating the pros and cons of the different options available. I'm normalizing the data since the units are all different and currently using a simple Euclidean distance to compute the distances between the vectors.

Can someone tell me if there are any significant drawbacks to using the Euclidean distance in this context? Is there another distance metric that would be more appropriate for this use case?

#### hlsmith

##### Less is more. Stay pure. Stay poor.
Not my area, but at least they are all continuous - I know it gets tricky when you have a blend of data types. Question, should you also think about weighting them, so they aren't all treated as having a comparable weight?

#### mthelm

##### New Member
Not my area, but at least they are all continuous - I know it gets tricky when you have a blend of data types. Question, should you also think about weighting them, so they aren't all treated as having a comparable weight?
I've thought about weighting them but it's not clear just yet how to do so. The other thing that I worry about with these variables is the possibility that they correlate with one another. Does that even matter in this context? For example, if average educational attainment correlates with median wages in a given county, would it be better to simply toss out the educational variable and just do the clustering based on wages?

Ultimately, the goal is to cluster counties based on those that have similar labor market characteristics and I will be exploring a large number of potential variables. For right now, I'm simply trying to wrap my head around the distance metric issue and I'm having trouble deciding on what makes sense for this use case. I may be over-thinking it and Euclidean distance might be just fine.