# Thread: which type of hypothesis test?

1. ## which type of hypothesis test?

Hello guys,

I have a sample of 10.000 accommodations in 2013- 2016.

I want to test two things but I am not sure I am on the right path.

First thing I want to test is whether in 2016 the we have less openings (new hotels/apartments) than in previous years. Will a simply t-test between the openings in 2013-2015 and 2016 be enough? Will I need before proceeding to any hypothesis test to check whether the variance is the same between the two groups (F-test)?

The second thing I want to test whether the price for rent of the properties in 2016 is more expensive on average than those in the previous years. A simple t test between the prices in 2013-2015 and 2016 is OK?

And in general, do I really need hypothesis testing with such a big sample?

Thank you very much in advance

Best,
Diana

2. ## Re: which type of hypothesis test?

Hi Diana,

I think a t-test ist perfect in both cases. However, you have to check the assumptions for a parametric T-test, which are: (1) normality of the data for each sample (e.g., via Shapiro-Wilk test or QQ-plots), and (2) variances are the same (e.g., via Levene's test or graphically). However, since you have huge samples, I would recommend to test these assumptions graphically, since Shapiro-Wilk-test and Levene's test can be significant even if violations of homogeneity or normality are pretty small and actually could be neglected. This is because with huge samples these tests have pretty much statistical power.

If both assumptions are met, perform a students T-test. If only homogeneity of variance is violated, you can perform a Welch's T-test. If normality is violated, you can perform a non-parametric T-test, such as the U-test or a permutation T-Test

3. ## Re: which type of hypothesis test?

Originally Posted by mmercker
Hi Diana,

I think a t-test ist perfect in both cases. However, you have to check the assumptions for a parametric T-test, which are: (1) normality of the data for each sample (e.g., via Shapiro-Wilk test or QQ-plots), and (2) variances are the same (e.g., via Levene's test or graphically). However, since you have huge samples, I would recommend to test these assumptions graphically, since Shapiro-Wilk-test and Levene's test can be significant even if violations of homogeneity or normality are pretty small and actually could be neglected. This is because with huge samples these tests have pretty much statistical power.

If both assumptions are met, perform a students T-test. If only homogeneity of variance is violated, you can perform a Welch's T-test. If normality is violated, you can perform a non-parametric T-test, such as the U-test or a permutation T-Test
Given such a large sample size, (assuming each year/grouping has a large number of cases), would normality really be a concern due to the central limit theorem? My suspicion is that normality is somewhat irrelevant in this case due to the large number of cases and the applicability of the CLT.
If the normality assumption is to be checked, though, I would agree in using normal probability plots since the formal tests tend to be highly sensitive to immaterial departures from normality.

I think this more so boils down to Welch's test vs student's t-test (depending on variances, which might still be less of an issue) given the CLT (and assuming this is only regarding two groups, otherwise, CLT is definitely not applicable).

Thoughts?

4. ## The Following User Says Thank You to ondansetron For This Useful Post:

mmercker (01-18-2017)

5. ## Re: which type of hypothesis test?

Thank you, ondansetron, for this very useful remark.

Indeed, in T-Tests and simple linear regression it seems to be that we don't have to check for normality if sample sizes are sufficiently high, c.f.:

http://www.annualreviews.org/doi/pdf....100901.140546

So, Diana, a Welch's T-Test or a students T-test should be optimal for your data - depending on your variances

6. ## Re: which type of hypothesis test?

Originally Posted by mmercker
Thank you, ondansetron, for this very useful remark.

Indeed, in T-Tests and simple linear regression it seems to be that we don't have to check for normality if sample sizes are sufficiently high, c.f.:

http://www.annualreviews.org/doi/pdf....100901.140546

So, Diana, a Welch's T-Test or a students T-test should be optimal for your data - depending on your variances
Now, the only caution I would offer is that, unless your sample size is quite large (several thousand, as it is here), you should still check for extreme departures from normality (although you can be less concerned with a larger sample or one that appears to be from a closer-to-normal distribution). For example, OLS is widely known to be "robust" with respect to several assumptions, including normality of the error term. In other words, the errors can depart moderately from normality and OLS will still perform well, but we should still investigate the assumption, just to be safe.

For the t-test (in this case with presumable thousands of cases in each group), I would imagine you needn't worry too much unless the sample indicates the population may be highly nonnormal. If it is potentially a problem, I would either transform the variable of interest (in an attempt to normalize) and rerun the parametric to see how the conclusion changes, or run a non-parametric "equivalent" on the variable to see if the qualitative results are substantially different (does the t-test say mu(a) > mu(b) and does the wilcoxon rank indicate that population a is right shifted (larger values, more or less) than population b?). If they don't disagree, you can be less concerned about the assumptions (either they're violated but not enough to impact the conclusion, or they're reasonably satisfied).

Finally, remember that the central limit theorem can't be called upon if there are more than 2 groups being compared at once (such as in an ANOVA with at least 3 groups). In that case, check all assumptions.

Edit: I can't find a source for another issue, so I decided to remove it. Can anyone comment on the CLTs applicability to the homogeneity of variance assumption with respect to t-tests, both independent and paired? I thought I've heard the CLT affords this to be relaxed as well, but I can't find a source right now.

UPDATE: I found a few texts I have that indicate large sample t-test (paired and independent) can relax normality and the homogeneity of variances assumption in addition to the normality assumption.

7. ## Re: which type of hypothesis test?

Now, the only caution I would offer is that, unless your sample size is quite large (several thousand, as it is here), you should still check for extreme departures from normality (although you can be less concerned with a larger sample or one that appears to be from a closer-to-normal distribution).
IIRC Rogojel once presented a simulation study here on talkstats which demonstrated that regression results are robust even in case of very non-normal residuals, if n > 40 or so.

Finally, remember that the central limit theorem can't be called upon if there are more than 2 groups being compared at once (such as in an ANOVA with at least 3 groups).
That is a special case of the general linear model, so the same principles as with linear regression apply (the residuals from the ANOVA should preferably be normally distributed, but with large enough sample size, the CLT guarantees robustness of the F-test).

With kind regards

K.

8. ## The Following User Says Thank You to Karabiner For This Useful Post:

ondansetron (01-18-2017)

9. ## Re: which type of hypothesis test?

I'm going to follow up with this since I had further interest in it. Given that we don't know how many observations you'll have in each group of the test, we can't give you a much better answer (slight imbalances leave the t-test fairly robust in large sample sizes, but larger imbalances with unequal variances might be an issue).

I'll post this thread from stackexchange that was pretty interesting and gives arguments for both sides (using Welch's test vs. Student's vs. Wilcoxon's). Essentially, you can post some output on here for guidance, or you can decide on your own, but you'll have to get a feel for your data in terms of the sample variance for each group and the number in each group of the test.

http://stats.stackexchange.com/quest...edirect=1&lq=1

Good luck!

 Tweet

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts