First, what is a lagged dependent variable? Is it the figure given in the previous year, say consumption today includes consumption of yesterday?

Second, how do we make a lagged dependent variable part of a multiple regression in R?

Third, if we can make it part of the lm model then does it mean that there is also a corresponding coeffiecient for it when we code coefficients(lmfit)?

Suppose your dependent variable is consumption. As you've said, if consumption today has an effect on the consumption in future time points, then there will be correlation in the observed values of consumption (called autocorrelation). In order to reduce this auto correlation, the lagged values can be fitted to the model.

In R, there is a package called "dyn" which does this.

Code:

require(dyn);
# example data
data<-structure(list(y = c(34L, 24L, 35L, 53L, 24L, 68L, 86L, 73L,
34L), x = c(3L, 4L, 2L, 4L, 2L, 5L, 2L, 4L, 5L)), .Names = c("y",
"x"), class = "data.frame", row.names = c(NA, -9L))
y x
1 34 3
2 24 4
3 35 2
4 53 4
5 24 2
6 68 5
7 86 2
8 73 4
9 34 5
# Specify time series proporties
y_1 <- ts(y)
x_1 <- ts(x)
# Fit lagged variables as an explnanatory variables
m1<-dyn$lm(y_1 ~ x_1+lag(y_1, -1))
summary(m1)
Call:
lm(formula = dyn(y_1 ~ x_1 + lag(y_1, -1)))
Residuals:
2 3 4 5 6 7 8 9
-6.882 -3.674 10.072 -22.003 11.005 30.310 14.235 -33.062
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.3952 22.8796 1.066 0.335
x_1 6.1071 6.2471 0.978 0.373
lag(y_1, -1) -0.1685 0.6409 -0.263 0.803 # coeff for lag
Residual standard error: 24.42 on 5 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.2526, Adjusted R-squared: -0.04633
F-statistic: 0.845 on 2 and 5 DF, p-value: 0.4829
#----------------------------------------------------------------------------------------------------------------------------------------------
# not a very good example/model as R-squared is negative. A non-lagged linear model could have done better I think (for this not so good example). In fact, can test it using simple F test.
# non-lagged
m2<-dyn$lm(y_1 ~x_1)
summary(m2)
# compare m2 against m1 (m1 nested within m2)
anova(m2,m1)
> anova(m2,m1)
Analysis of Variance Table
Model 1: y_1 ~ x_1
Model 2: y_1 ~ x_1 + lag(y_1, -1)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 6 3023.2
2 5 2982.0 1 41.195 0.0691 0.8032

Oh Thou Perelman! Poincare's was for you and Riemann's is for me.

Also, is there a way to get through what I need without downloading the package dyn?

There must be, but I am not too sure of.

Do I have to put L beside of the figures like what you have y = c(34L, 24L, 35L, 53L, 24L, 68L, 86L, 73L,
34L)?

No. You don't have to worry about those L. This is how R stores numbers internally. So, you shouldn't worry about it at all.

How can I include the lagged dependent variable of my existing formula:

lmfit1<-lm(Data1$C~Data1$Y+qtr)

So, you have got 4 time quarters . In this case, you would want to add separate lags at each quarter?
I am afraid but I am not sure how to add different lags for different quarters (and I don't want to give you a vague answer). We never went beyond a simple one page example in our course. Wish more of it was covered.

Oh Thou Perelman! Poincare's was for you and Riemann's is for me.

I've never used the dyn package, and I think having your data in a time series object (ts) has ways of making it easier to do these sorts of regressions. In any case, I'm not going to sit here and try to explain the whole theory behind autocorrelation (and I hope you already know multiple regression). The basic idea, though, is that you literally put a variable (on the common approach) into your model that is the prior year(s). This modifies your error, though, because now the error depends on previous years (the algebra isn't that hard, though).

I've not done too many of these in practice, but when playing around with lagged variables in R, I usually just use sequences in a convenient way that models our syntax.

For instance, suppose my dependent variable is 'y' and has length n (length(y) == n # reports TRUE). Then I make myself an index

Code:

t <- 2:n

Why did I choose 2? Because a lag is always that much less than your full size n. This also makes it convenient to deal with the sequence 1, ..., n-1. All I have to do for that sequence is look at t-1. R handles the vector algebra by subtracting 1 from each element. In other words, t is 2:n and t-1 is 1n-1). This gives us our current series y[t] and our lagged series y[t-1]. Nice syntax, right? So now I fit my lagged model with something like

Code:

fit <- lm(y[t] ~ x[t] + qtr[t] + y[t-1], df)

There's actually a function that does this 't' variable for you in the sense you can specify the lag you want on a variable. I believe it's the diff function (see ?diff). The problem is that it's useful for a given variable, but controlling 't' like I do lets me easily supply it to my other vectors. I can also use it to apply to the dataframe itself. In this respect, I might do something like

Code:

fit <- lm(y ~ x + qtr + lagy, cbind(df[t, ], lagy = df$y[t-1]))

Here I am returning only the t-row subset of df and creating the lag variable (so named as it is used) on-the-fly. In this respect, it might be useful to use diff.

Some resources for time series in R: link, link, and link

Aside from literally encoding the lagged variables, is there a way that R will print it? Like when you taught me about letting R do the dummy variable, using model.matrix(~Data2+qtr-1) will print the dataframe along with the dummy variables as additional columns of my dataframe.

On any regression, you can use model.matrix to return your X matrix used in the regression Y ~ Xb. Another useful method is model.frame that returns the data frame used in the regression.