I am helping my sister with some statistics in her master's dissertation in behavorial/social sciences. I have studied a couple of courses on statistics/probability, but my proficiency could certainly be better.

I have this book "An Introduction to Mathematical Statistics and it's Applications" by Larsen & Marx, which I think is very useful and helped me quite some way already.

This is the problem. Two treatments, PT, DT. 5 pupils. Tested 3 times. We want to show if there's a difference in retention (multiplication problems) between PT & DT at any time, and/or overall.

Doing a paired t-test for each of the 3 time-points increases type-1 likelihood, so I advised against doing that, but I don't really understand what should be used instead. Adding all 3 time-points could of course obscure important trends as well.

The t-test showed p-values of 0.038 and 0.4 and 0.6 approximately for the three datapoints. 1st showed DT superior, the others, of course, close to zero.

Doing the ANOVA for any difference of (DT-PT) between the three time-points gave a p-value of 0.09, i.e. no significant differences between groups of DT-PT average, but this isn't the same as testing the null hypothesis of DT-PT=0 for each time-point, collectively, which ideally I would be looking to do. Could you do three t-tests at p=0,017 to reach a significance level of 0,05, collectively, as the probability for making at least one type-1 error is p, calculated by this formula (1-(1-p)^3)=0,05 ?

I also understand that to do the ANOVA the samples should be indpendendt, but since the pupils are the same throughout the study, this condition isn't satisfied, I guess. Is it a huge problem? Could it be remedied by doing other types of tests?

"Bonus questions" if are very interested in more details. Consider it entertainment or something. This is more to satisfy my own curiosity for knowledge rather than give information my sister certainly will use.

Normality assumption. I calculated the distribution of the test statistic which is a discrete score and calculated these graphs:

Pretty good fit but are the discrepancies important? Short of random modelling, can these things be (easily) calculated? I don't suspect anyone to do this, of course.

I guess you could do a randomized block design, F-test, with 3 time points and PT/DT, 6 blocks. But the data material is very small and the measurements would be even more dependent.

Power and type II errors. Is this straightforward to calculate for this example?

I am doing my calculations in a spreadsheet.