t-test for large sample size - bike hire data

Hello everyone,

I'm taking a graduate-level statistics course at the University of Glasgow which includes a methods assignment. The assignment requires us to correctly execute several statistics, including one and two sample t tests.

To complete my assignment, I'm using a large dataset of bike share data that spans 25 months, (discrete hires, n= 221,484).

To check the two sample t-test, I have an idea to test for a significant difference between two means (obviously) in annual bike hire data. The goals is to test the null hypothesis, “there is no difference in the average number of bike hires between the first and second year”

The objectives are, to
1. count the number of hires per day for the entire study period (frequency) and
2. cut the first and second years out to create two new datasets. In which case n=365 for two both samples new samples.
3. Execute (in R), t.test(YearOne$Freq, YearTwo$Freq, paired = FALSE, conf.level = 0.9999)

However, I’m confused about the eligibility of this data with a t-test. Becasue..
1. N>30, and
2. I can calculate the standard deviation (SD) for both datasets.

Why I’m confused, is because my book (Andy Fields: Discovering Statistics Using R) allows a degree of freedom over 30 (actually, up to 100, then infinity). Furthermore, the standard deviation of the samples can be calculated, but I may not calculate the SD for the overall programmes lifecycle.

Any thoughts? May I use an independent t-test to test the null hypothesis?

This may broadly be a misundersatnding of what is a t-test, and what is sample..

Thank you all in advance.


Less is more. Stay pure. Stay poor.
What do you think is the difference between a two sample t-test and independent sample?

You can absolutely use the t-test with n-value > 30, it is below 30 when some people question its use. I don't follow you SD concerns. Do you mean the population SD and the sample SD?

Comment, you are probably fine using this data but is it really continuous? Can you have 2.5 hires? Or is it count data?
Hi Hlsmith,

Thanks for your response, yea – I guess I was a little unclear.

Yes, count data, daily count data more accurately. I used the cut() function with lubridate’s argument, “days” - followed by table() to create a data frame of bike hires per discrete day.

I understand that an independent t-test can be a two sample t-test when each sample data are derived from separate individuals (in this case, days). Apart from dependent t-test, or when sample data were derived from the same individuals.

The Standard Deviation (SD) issue has more to do with the logic of a t-test. Isn’t it the case that one would use a t-test only when the SD is unknown, when the standard deviation of the global population is unknown?

I’m unsure because, in this case, I know the global population SD and both sample’s SD… or would the unknown SD be the bike hire project lifecycle, and therefore unknown?