Missing Level on linear model tests

#1
So I'm an R newbie, we're attempting to use it for some regression analysis at work on some of our data sets. To start we wanted to take a very simple data set that we had and attempt to fit a linear model to it.

The problem that I'm running into is that once I import the data file and perform the lm() function I lose one of my levels, I don't understand where it has gone or if I'm just interpreting the output wrong.

the output looks like this:

Code:
 fit<-lm(TotalPercPaid120~AgeBucket)
> summary(fit)

Call:
lm(formula = TotalPercPaid120 ~ AgeBucket)

Residuals:
    Min      1Q  Median      3Q     Max 
-90.496  -0.495   0.264   0.317  45.452 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       0.597990   0.009114  65.611  < 2e-16 ***
AgeBucket25      -0.087959   0.015536  -5.662 1.51e-08 ***
AgeBucket30      -0.104718   0.016645  -6.291 3.17e-10 ***
AgeBucket35      -0.102780   0.016092  -6.387 1.70e-10 ***
AgeBucket40      -0.072274   0.015402  -4.693 2.70e-06 ***
AgeBucket45      -0.039197   0.014904  -2.630  0.00854 ** 
AgeBucket50       0.033393   0.013828   2.415  0.01574 *  
AgeBucket55       0.085377   0.012923   6.607 3.96e-11 ***
AgeBucket60       0.116011   0.012731   9.113  < 2e-16 ***
AgeBucket65       0.162438   0.012688  12.802  < 2e-16 ***
AgeBucket70       0.109460   0.013485   8.117 4.85e-16 ***
AgeBucket75       0.086453   0.014602   5.921 3.22e-09 ***
AgeBucket80       0.121772   0.015791   7.711 1.26e-14 ***
AgeBucket85       0.137719   0.017063   8.071 7.08e-16 ***
AgeBucket90       0.154927   0.021803   7.106 1.21e-12 ***
AgeBucket95       0.163869   0.038299   4.279 1.88e-05 ***
AgeBucketPlus100  0.052145   0.068789   0.758  0.44843    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.728 on 64449 degrees of freedom
Multiple R-squared: 0.01448,    Adjusted R-squared: 0.01423 
F-statistic: 59.18 on 16 and 64449 DF,  p-value: < 2.2e-16
there should be another level "AgeBucket16", it's the first level of the AgeBucket factor.

I get the same problem when I perform an anova using the same factors, I lose the first level of both my "AgeBucket" factor and my "Hospital" factor.

Code:
 fit2<-aov(TotalPercPaid120~AgeBucket+Hospital)
> summary(fit2)
               Df Sum Sq Mean Sq F value Pr(>F)    
AgeBucket      16    502   31.36   59.24 <2e-16 ***
Hospital        1     39   38.72   73.15 <2e-16 ***
Residuals   64448  34118    0.53                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
> coef(fit2)
     (Intercept)      AgeBucket25      AgeBucket30      AgeBucket35      AgeBucket40      AgeBucket45      AgeBucket50      AgeBucket55      AgeBucket60      AgeBucket65 
      0.63715260      -0.10455534      -0.11868351      -0.11665114      -0.08450580      -0.05072533       0.02153494       0.07419295       0.10470710       0.15132806 
     AgeBucket70      AgeBucket75      AgeBucket80      AgeBucket85      AgeBucket90      AgeBucket95 AgeBucketPlus100  HospitalSt Mary 
      0.09947468       0.07662257       0.11098498       0.12526939       0.14027786       0.15178454       0.03671460      -0.05010209
can anyone shed some light on what I'm not picking up on??
 

Jake

Cookie Scientist
#2
The default contrasts for factors in R are based on "dummy coding," where one of the levels of the factor serves as the reference group against which the other levels of the factor are compared. By default, R chooses the group that is alphabetically "first" to be the reference group. So you don't have a slope for the first group because it wouldn't make sense to compare a group to itself. Instead, the predicted value for the reference group is encoded in the intercept.

Note, however, that with multiple dummy-coded factors like you have in your example, the intercept will not exactly represent the predicted value for any particular cell. In the case of balanced data the intercept here will = the mean for reference group 1 + the mean for reference group 2 - the grand mean. Although this equation becomes more complicated in the presence of unbalanced factors and/or continuous covariates.
 
#3
What happens when you have a data set with multiple factors and multiple levels in each factor? Even if they're balanced it seems that including every first level in the intercept would leave a lot to be desired from an analysis standpoint. For the example I showed we did normality tests which showed that those specific models were not a good fit to represent the data, so we wouldn't use that for our work anyway, but that won't always be true.

Is there some kind of work around for this?