Im trying to measure 'consistency' in baseball scores amongst two different sample groups.

For example, in a sample of 162 games each, one team had a mean of 5.06 runs scored and a standard deviation of 3.43. One team had a mean of 3.94 and a standard deviation of 2.62.

Which team has been more 'consistent'? I don't know what the real method is here. I don't think it's linear.... like, stdev/mean. What I'd like to be able to say is that one team was more or less consistent than the other, and I wonder how I'd arrive at that answer.

Using the standard deviation of their scores seems appropriate to me because it is simply a measure of variation standardized by the number of observations.

An alternative way of doing it would be to use the coefficient of variation (CV = st.dev/mean) which you mentioned. But this doesn't seem appropriate if you are comparing within the same sport. If you are comparing consistency across sports, I would use the CV.

For example, if you are comparing basket ball scores (often in the 70s) to soccer scores (usually in the 0-3 range) you would need to standardize by the mean (use CV) because basketball scores would automatically have more variation.

But just to set up an extreme example, to get to the question of principle. Say one team, due to their environment or something, scored 11 runs/game with a stdev of 2.0. Another team, due to their environment, scored 4 runs/game with a standard deviation of 1.9. Does it strike you aare accurate to say the low-run team's performance was less variable than the high-run team?

I keep thinking there must be some straightforward adjustment to make here...

It seems reasonable to me to use standard deviation as a measurement of consistency. I'm not sure how you could test the 'significance' of such a statement, especially if the standard deviations are so close (1.9 vs 2.0).

Maybe someone else knows how to test for significant differences in variation. You could possibly use bootstrapping to get confidence intervals around the st.deviations, but that seems like overkill to me.

One tricky thing here is that runs scored is a count variable. Mean and standard deviation are positively related in most distributions modelling count data - e.g. Poisson, negative binomial. I.e. teams with higher mean runs scored are going to tend to have a higher standard deviation; does this really mean that they're less consistent?

If we take consistency here to mean less variable or what stock analysts would call less volatile and taking into account the reality of what CowboyBear says, then comparing the CVs would do the job but I do not know of statistical test for two CVs.
oparairoegbu