How to account for unequal sample sizes in proc logistic (SAS)

LZC

New Member
#1
Hello!

I am running a logistic regression with a binary outcome ('helmet') and three categorical predictor variables ('personalebike', 'personalcbike' and 'personalescoot'.) Each of the predictors have 3 levels: shared vehicle ('2'), personal vehicle ('1') and neither ('0'). I am comparing the odds that a rider of a personal e-bike, c-bike and e-scooter will wear a helmet compared to their shared counterparts (so '1' vs '2'). Below is the SAS code (feel free to comment if you would adjust something!)

proc logistic data = bikehelmets descending;
class personalebike (ref='2') personalcbike (ref='2') personalescoot (ref='2')/param=ref;
helmet (event='1') = personalebike personalcbike personalescoot/link=logit technique=fisher;
run;

My concern is that the sample sizes are pretty unequal and I am worried it is messing with my odds ratios. For example: for personal c-bikes, helmet use is n=2770 vs n=462; for shared c-bikes, helmet use is n=47 vs n=71. It's a less dramatic difference for e-scooters, but the sample sizes are way smaller. For personal e-scooters, helmet use is n=74 vs n=60; for shared e-scooters, n=12 vs n=56. Same with e-bikes: helmet use for personal e-bikes is n=201 vs n=32; for shared e-bikes, n=267 vs n=420.

How would you guys go about accounting for the unequal sample sizes between personal and shared vehicle?

Thanks!
 
Last edited:

hlsmith

Less is more. Stay pure. Stay poor.
#2
Well, it is what it is. Almost all observational data is going to be unbalanced in independent variables and you can end up with some sparsity in subgroups. There really isn't anything you can do about it, the result will be larger SEs and wider confidence intervals for the groups with fewer observations. This is because you are obviously less confident in the generalizability of results given the small sample.

There are approaches for balancing outcomes, oversampling, undersampling, both, creating synthetic data, etc., but given your scenario, I would likely recommend running the data the way it is. When you rebalance data you are artificially changing dynamics and could introduce selection bias or required post-estimate corrections. If you are unhappy with precision you will have to collect more data, but would need to report the change in protocol whoever you report results to, so they know why you did it.