When is the distribution abnormal, judging from the mean, median and modus?

#1
I have a poll data table with which be filled with between two and six thousand entries (one entry contains some 30 data). It is a table in SQL, a computer database language. The table will contain primarily numeric data.

I want to compute the correlation strengths between certain variables, for which I need to know whether their distribution is normal. I cannot use the normal distribution tests, because SQL has only a few native statistical functions, the most complicated being the population standard deviation. And I do not know enough about SQL to write complicated test codes myself. I have written the correlation formula code, but that is as far as my knowledge goes.

Instead, I want/have to compare the mean, median and modus, which should all be (more or less) equally positioned on the x axis in a normal distribution. But where lies the cut off point? When must I conclude that the distribution is not normal? Have difference percentages ever been established for that? Or is there an experienced statistician that could give me an educated guess?
 
Last edited:

noetsi

No cake for spunky
#2
PROC SQL is an inherent part of SAS so you can do any statistic SAS can. But I am assuming you are not using that form of SQL.

Can you test whether your data is normal before you run it in SQL? In some statistical software?

Probably a good starting point is to ask which SQL you are using and what statistical software if any you have access to. :)

If you generate a tukey boxplot (assuming you can) the box it generates (where the median will be within the box although this is not critical to this analysis) should be in the middle roughly of the whiskers. However, formally this does not mean you have a normal distribution - although it is often interpreted this way.

If the data is perfectly normal the mean and median will be in the same place. But I don't believe there is any test of if the data is normal based on how far apart the mean and median are.
 
#5
Thanks for jumping in, guys.

@noetsi: I actually am a front-end web designer (HTML, CSS, JavaScript, basic PHP and basic MySQL). I developed a customer research tool written in MySQL with which customers do not need statistical software anymore to analyze the data. And that is quite a selling argument. The downside is that the tool's analytical options are limited to correlation strength between variables, but that is generally all that is needed in customer research. I could have access to all kinds of statistical software; I could just buy them. But I need to be able to work without 'em, using only MySQL's and PHP's native statistical functions, which are only a few.

@BMG: That link is exactly what I needed! Thanks a million! Although I would also have the question of how to interpret the value of the statistic you gave there, the OP in that thread made some very helpful comments! :)
 

noetsi

No cake for spunky
#6
Thanks frankc for clarifying that. I use SQL a lot as the front end of statistical analysis - but the statistics is the important point. For that reason I need an SQL like PROC SQL that lends itself to statistical tools.

One thing you might want to consider (or not depending on the knowledge of your audience) is that correlations between dichotomous variables will be calculated wrong when you use Pearson's R (you need polychoric correlations) and that this will also occur with ordinal data (at least that which has a few levels). You need polychoric or Spearman's rho for that.

In my experience a lot of practisioner data is ordinal (likert scale) or dichotomous (yes/no).
 
#7
@noetsi: I am aware of when to use parametric and non-parametric analyses. And although you are right when you think that only being able to calculate Pearson's rho is a severe limitation, questionnaires can often be devised in such a manner that the answers are a continuous scale. Such as: "How do you rate the friendliness of the staff, on a scale from 0 to 10?" And: "How likely is that you will recommend our company to others, on a scale from 0 to 10? "

Further, sometimes just showing the means of the different groups on certain questions is all that is necessary. The same counts for dichotomous variables. Lastly, although not easy, I could write the MySQL code for the Spearman's rho formula.
 
Last edited:

Dason

Ambassador to the humans
#8
How much data are we talking here? It's not too bad to have R make a MySQL call and analyze the resulting data.
 

noetsi

No cake for spunky
#9
An interesting point about making the question interval like frankc. I suggested this to a group of statisticians and they pointed out an obvious problem. This assumes that individuals can actually assess something on ten distinct levels and that those levels will have equal distances which is doubtful if unknowable. It also assumes of course that individuals will interpret performance consistantly in terms of the scale which I suspect declines as the number of levels of the scale increases and are unanchored by terms such as least most etc.
 
#10
@noetsi:
The 0 -10 scale has been tested extensively in medical research settings and has shown to be sufficiently reproducible (= in unaltered circumstances, the test result this week must be [as good as] the same as the result next or last week). That only leaves the validity (does it give the same result as the gold standard), but the question is what should be regarded as the gold standard. This test or the other test?

@Dason:
As I wrote before, I must be able to do it without any statistical software, with only the native MySQL statistical functions. Which are only a few very basic ones, the standard deviation and the square root being the most advanced.