reliable results?

Hi, my name is Dave. I am a recent graduate trying to pass a licensing-exam with the state of California. I think I passed the test (my score=73%), but they use a criterion-referencing method where results are compared to a criterion (their passing score= 75%) rather than to the mean. The idea is that level of difficulty of tests vary, and by comparing a score to the criterion, it does not matter what the group does. But I think the group performance should be consistent with the results, ie, in the 2nd to last iteration the test was deemed more difficult, the pass-point was lower at 118 of 175 correct and (66% passed). Last iteration was deemed easier, the pass-point was set higher at 132 of 175 and (48% passed). So the "harder" test had a higher pass rate and the "easier" test a lower pass rate. These are common sense results, but I think the criterion is wrong.
I don't know much about statistics, but I think if the test really does vary in difficulty, approximately the same % of test-takes should pass each time, if the pass-point is moved reliably. I don't think this is the case.
I plugged the students passing percentage of the last 20 iterations of the test into online calculators for standard deviation and z-score. Each of these 20 results represents a group with a population average of about 750 test-takers. The test-takers are all recent graduates of a Masters program, taking a test that is offered every six months. So this is not a random population.
Mean 56.55
standard deviation 6.56526
Variance (std dev) 43.10263
Population std dev 6.39902
Variance (pop std dev) 40.9475
Value 48
z-score -1.30230943

Do these look like reliable results? Mean 56, standard deviation 6.5 zscore -1.3?
These are the last 20 results in case I calculated incorrectly:
Last edited:


Ambassador to the humans
I don't quite understand what you're trying to do. What do you mean by "Do these look like reliable results"?


TS Contributor
If their criterion are based upon valid and sound arguments, then the results should reflect this, but 52% failed.
Do you mean to suggest that you expect that 75% of the students should pass the exam? This is unclear. The criterion 75% is just the points threshold for determining pass/fail (a binary condition). So, 95% of test-takers could pass or 20% could pass. If by "reliable" you mean "does this year's result appear consistent with the past 20 scores," then it appears like this year was a pretty standard year.

Maybe some additional information, like Dason requested?
Are the standard deviation and z-score typical of a standardized test? I included the last 20 iterations, which are the passing percentages of the total group in each of the last 20 times the test was given, just in case I calculated the mean, standard deviation, and z-scores wrong. Each of the 20 represents a group that consists of between 500 and 1000 test-takers, who are all recent graduates of a Masters Program in Acupuncture. I think the results are both very low, with a mean of 56% of the group passing, and also very volatile (68-47=21% range) I don't think a homogenous population on a standardized test should have results like these, and that something is wrong.
The criterion-referenced scoring in this case establishes a criterion that a group of "experts" arrive at. Kind of like a jury, they deliberate each question and assign a level of difficulty for that question. The levels are averaged and this is the criterion or pass-point. This is a highly subjective process, I believe.
Does a mean of 56 and a standard deviation of 6.5 and a z-score of -1.3 look like results from a standardized test, because I don't know what good results would look like. I suspect they are not reliable. I need the help of someone who understands statistics. Thank you
75% was the passing score. The passing percentages for the last twenty iterations are in the original post, they range from 47% of the students passing to 68%. I would expect the variation of a group of recent graduates to be both higher and closer together. Thank you for your input. I need help!


TS Contributor
I would expect the variation of a group of recent graduates to be both higher and closer together.
I'm not sure I would expect higher variation. Or do you mean a higher mean relative to past years?

Anyway, the distribution of the last 20 tests looks approximately normal. Thus, z-scores are a good way to show the relative performance of test takers (but usually within a single testing year or period...not between years, as you've done here). A z-score of -1.3 (meaning that this year's score is only 1.3 standard deviations below the mean of all years' tests) means that the score isn't particularly out of the ordinary. When you see a z-score of 2.5 or -2.5, that's when you can start to say something about large differences, etc. So, it seems to me that you did a good test and found that this year's test score fell almost right in the middle of all years' scores. To me (and others might disagree) this suggests that the experts have a decent ability to keep their assessments fairly consistent. That's good for the field of acupuncture, for sure!
Last edited:
The national acupuncture results are much more consistent--average between 70 and 75% passing rate. Nurses average between 80 and 90% passing rate. Md's average between 92-95% These three are all results which I would expect from test population >500 with standardized tests. California acupuncturists average between 47 and 68% passing. Are we a bunch of pot smokers? j/k
One big difference in the tests is the Ca acupuncture test is written not by educators, but by practicing acupuncturists who are mostly Asian and English is a 2nd or extra language. In short the English is not reliable. When they run the workshop where they determine the pass-point, they don't verify English ability. I don't have the data available without filing a freedom of information request, but if I had the data from the last 20 iterations and I plotted them on a scattergraph, I don't think they would vary as much as the passing %'s. I think the pass-points are false. ie, 2nd to last iteration was deemed a difficult test, the pass point was set at 118/175 and 66% passed while last iteration was deemed an easy test so the pass-point was increased to 132/175 and 48% passed. If the last iteration was really easy, why was there such a drop in the passing percentage?