Selection bias in categorical independent variable in regression analysis


I have an issue with selection bias in my independent variable and I am uncertain of the best approach to correct for it.

I'm looking at soccer data and trying to make some forecasts on players from different amateur level clubs to what they might be expected to do at the professional level. Therefore, the categorical variable is the league that the player's club is in (low level or high level).

The issue is that there are a few players from lower level leagues who actually make it to the professional ranks and perform well. The regression model suggests that being form a lower level has a high predicted future performance than playing at a higher level. This obviously is not correct. The problem here is selection bias as the few lower level players are not representative of larger population from that level. Similarly, there are a larger number of players at the higher level leagues who make it to pro status, however some of them do not pan out as good players.

As such, is there a way to correct for the selection bias of the lower level players and the substantial amount of sample size differences in the classes of the categorical predictor in my model?


So there are few low level not good players or none? And low level good players make not good players look better on average. I dont think this is selection bias since those not good players dont exist that succeed. You are just missing a variable that better defines players. Just like if i am look at hospital survival in patients admitted from home or skilled care facility. If i just look where they came from and not how sick they are I have a pretty crude model. There is a third variable (confounder) associated with where they are from and if they will succeed.


Thanks for your comment, hlsmith. This makes a lot of sense. There are indeed a substantial number of low level players that are not good and never get drafted up to the MLS from college. Because they never make it and because they come from a lower university (where there is no real data collected on their ability), I don't have any other information to go on. It really is more an issue of lacking the right data to answer the question than anything else.


I can get goals scored but I'd like more context than that, which isn't accessible at those levels right now.