Okay, if you saw the chatbox, then you're obviously a regular here. You may have also noticed I asked a question regarding some research I'm doing at my new job (energy supplier). Another student and myself are tasked with creating a model to explore the relationship between household energy consumption (in KWh) and their income, given characteristics about the time (month), the house (sqft, year built), and the owner characteristics (education level, age group, race, etc.).
In particular, we're interested in estimating the income elasticity of demand for KWh (and later price elasticity), which is just a way of saying how sensitive is their demand for energy given increases in their income. This is measured by . Like you look at the derivative to get instantaneous velocity, we want to consider the derivative of this relationship. In economics, you typically take the log(Y) ~ log(X). The log(X) coefficient just is that elasticity. You don't have to use logged variables, but it is not as clear if you don't.
In any case, the observations have two different levels (say "clusters"). One is at the household level which is constant over the year in this data. The other level is the monthly information which is predominately the KWh. Other information is basically the time--month or recoded as the season to which it belongs: summer, winter, or spring-fall. So, for instance, income is listed as their monthly income, but it's the same each observation. So if you plot KWh ~ income, you basically see what looks like plotting a categorical variable: you have it at repeated measures at various levels of income because it's the same contract account ("household").
Analysis
So Jake saw this and says I should consider multilevel modeling. I said I thought about repeated measures, which is a form of it, of course. Great. But I've never actually done it, at least not with any competence lol
(1) Is lme4 the go to package for this? What others do you consider to be especially useful?
(2) Let's consider the more basic model, KWh ~ Income repeated on contract account (an ID field). How would I specify this in lmer? Is this formula correct?
Code:
kwh ~ income + (1|contract)
Do I include the other variables (fixed effects?) as in the normal linear model formula? When I include things in the parentheses, I'm basically saying that the given variable is a sample from a different population with its own error structure, right? This is stuff I usually find in graduate level statistics, and I haven't had the luxury of such a course, unfortunately. I want to make sure I'm comprehending it correctly and not just having R give me output I'm clueless about!
(3) What limitations should I consider in such a model? In this case, doesn't the number of repeated measures matter? Some observations I know are being excluded because we're simply lacking a monthly measure (less than 1% of the observations, but they're there). We're also missing a lot of the demographic data (sqft isn't always recorded or we don't know the ethnicity of the owner, e.g.).
Any help is appreciated here.
Cheers
(Note, I debated whether I should put this in the statistics or R forum because half of this is about the theory and the other half is about how to implement it in R. I'll accept deferring me to literature to brush up on the theory, but I'd like help making sure I implement it correctly so when I actually am interpreting results, validating my model, etc., it's actually the model I'm trying to specify!!)
I'm thinking the most basic model would be something like
Code:
fit <- lmer(kwh ~ income + (income|contract), df)
I'll have to see the output to start trying to make sense of it and ways to explore this data as repeated measures. I'll add other things I find interesting or useful.
One thing that I need to figure out is where to I include time (month, say, or season), because that trend should be considered in even the most basic model (before considering things like the demographics, etc.).
It would really help me wrap my head around the structure of the dataset if you could dput a small piece of it. Say, 9 households. (I say 9 because that number would lend itself well to some nice lattice plots of the resulting coefficients by household, and still has enough clusters to give a sense of what is happening in the dataset as a whole).
“In God we trust. All others must bring data.”
~W. Edwards Deming
I'll see what I can do, but the data is proprietary?
The data might look something like this
Code:
contract income race education rate kwh month season
111 1000 white some HS A 540 1 winter
111 1000 white some HS A 620 2 winter
111 1000 white some HS A 700 3 winter
111 1000 white some HS A 680 4 shoulder
111 1000 white some HS A 700 5 shoulder
111 1000 white some HS A 800 6 summer
111 1000 white some HS A 710 7 summer
222 2000 asian college B 300 4 shoulder
222 2000 asian college B 420 5 shoulder
222 2000 asian college B 340 6 summer
222 2000 asian college B 440 7 summer
222 2000 asian college B NA 8 summer
222 2000 asian college B 320 9 summer
Fields 2-5 are contract variables, the monthly income they report they make (consider it an average for the year), their education, the rate they pay for energy (there's 6 levels we've aggregated, but at minimum it would be electric vs gas rates). Fields 6-8 are usage variables which is the KWh they actually used that month, which month it is, and the season that month is considered--shoulder is spring and fall. The contract number links the two sets of variables (I keep them in separate tables, actually, and merge them as needed). As I indicated, though, there are other variables we have available, such as sqft of the house, the year the house was built, the length the person has lived in the house, the family size, and more. Are they all needed for the analysis? Of course not, but they may prove useful for a more developed model. I'm just trying to get a simple own down first! lol
Right now we're just looking at one year worth of data, so as most we'll have 12 months worth of usage data for a contract. The contract variables, though, are also incomplete, as some information just isn't recorded. I don't have the numbers, but a full NA omit removes about 20% of the data I believe. We have a few hundred-thousand contracts, so there is plenty of data. I was wondering if the number of months would be too limited, though, in a repeated measures design.
Looking at the author's website for chapters 13 and 14, they seem to use aov with differing error structures and lme a lot. I like the lmer syntax better than lme, but I'll have to look into that function and read up on it. I think ultimately, I need to just understand mixed models a bit better and find some examples to understand how to interpret results.
The basic model that you want to fit is the one you wrote in the first post -- not the one from your next post, where contracts have random income slopes. Since each contract has only 1 income value, it is not possible to estimate a different income slope for every contract. Rather, the overall income slope is estimated across contracts. This is illustrated in the plot produced below.
I put together some code where I generate data like yours, fit the basic model, and plot some results:
The error bars around the data points in the plot represent our degree of uncertainty about that data point. Different contracts have different degrees of uncertainty because there is randomly missing data, so not all contracts are contributing the same amount of information. Contracts with more complete data have a stronger influence on the regression line.
This is only a first step. There is more that I want to say and more that will need to be added to the model (e.g., the influence of seasons, which could have random slopes across contracts), but it is late now and I need to go to bed...
“In God we trust. All others must bring data.”
~W. Edwards Deming
I wanted to say a little more about basic issues in multilevel modeling that you will want to know before setting out on fitting some multilevel models.
First is what it means to designate effects as fixed or random. As you know, the syntax for these specifications are generally of the form "(predictor|group)". (More details on syntax HERE.) This can be read as: the effect of predictor has a random component (in addition to the fixed component) which varies across groups. Basically, this allows each group to have a different slope for predictor. (Actually, this particular piece of syntax also allows for random intercepts and a slope-intercept covariance, as mentioned in the link above). Each group's slope is calculated as the overall or fixed effect of predictor + the group's random effect of predictor. Here is some illustrative code still in the context of the kwh~income example, except I am using month as the predictor since this is something which varies within contracts.
In the plot produced by this code, the thin, solid regression lines represent the OLS regression line that would be fitted to only that contract's 12 data points, completely independent of the other contracts. Gelman refers to these as the "no pooling model" estimates. The thicker, dashed lines are the fitted contract-level regression lines according to various specifications of multilevel model. A couple things are noteworthy here. First and most obviously, when a particular effect is fixed, it is the same across all contracts. For example, in the panels where month slope is fixed, all contracts have the same slope for month. Second and less obvious is the "shrinkage" that is applied to the contract-level coefficients in the multilevel models. Gelman refers to this phenomenon as "partial pooling" -- the estimates for each contract are "shrunk" toward the fixed or population-level estimates to a certain degree, this being a function of our degree of uncertainty about that contract (which, in turn, is a function of the amount of data for that contract, the variance of the predictor values for that contract, and the difference between the contract-level and population-level estimates). You can also observe this shrinkage phenomenon in the code from my first post by comparing the "blups" variable to the simple contract-level kwh means. The idea is that we estimate each contract in light of what we know about all the other contracts, in a pseudo-Bayesian fashion. So we are pooling together common information about contracts, but still retaining individual estimates for each contract. (The "complete pooling model" is a simple OLS regression model where we completely ignore groupings in the data due to contract, essentially treating all the data as if it came from a single contract.)
Now I want to say a quick word about correlated random effects and centering. In multilevel models we can easily estimate correlations between, say, the intercepts and slopes for a contract (e.g., perhaps contracts with low average energy usage show strong, increasing temporal trends in energy usage, while contracts with high average energy usage show relatively flat temporal trends -- a negative correlation between intercepts and slopes). But you need to be careful when doing this. One of the basic difficulties here is the fact that, just as in ordinary regression, the intercept represents the predicted value when all the predictors are 0. So if 0 is not a meaningful value of the predictors and falls far outside the range of your observed data, it is easy to introduce spurious correlations between slopes and intercepts simply because of how the predictors are coded. Mean-centering the predictors goes a long way toward addressig this. This is illustrated below with some code that I have repurposed from a few months ago (so the example context is a bit different, and I'm not actually fitting any multilevel models, but the ideas are still illustrated nicely I think):
The thin black lines are the independent state-level regression lines, and the bold red line is the overall slope for the predictor (from the "complete pooling model").
So you need to think about not only when and whether you should center your predictors (in general it rarely hurts), but also about how to center them. Should you center them about their grand means, or center them about their group-specific means? There is no straightforward answer here, I just note that this can affect the interpretation substantially so it needs to be thought about carefully for each situation at hand. My impression is that usually people center about the grand mean, but you certainly see both on occasion.
“In God we trust. All others must bring data.”
~W. Edwards Deming
My post got too long so I had to continue it in a second one.
Originally Posted by bryangoodrich
(1) Is lme4 the go to package for this? What others do you consider to be especially useful?
lme4 is now the preferred package. nlme is older (and increasingly outdated it seems), but is more well developed and better documented, and lets you do some cool stuff not implemented yet in lme4, like flexible modeling of spatial/temporal autocorrelation, heteroscedasticity, etc. lme4 is technically still under development, but is much faster than nlme, handles crossed random effects far more gracefully, and has a nicer syntax. There are also a variety of other multilevel model packages (LINK).
Originally Posted by bryangoodrich
(2) Let's consider the more basic model, KWh ~ Income repeated on contract account (an ID field). How would I specify this in lmer? Is this formula correct?
I think I covered this above but let me know if some of it isn't clicking yet.
Originally Posted by bryangoodrich
(3) What limitations should I consider in such a model? In this case, doesn't the number of repeated measures matter? Some observations I know are being excluded because we're simply lacking a monthly measure (less than 1% of the observations, but they're there). We're also missing a lot of the demographic data (sqft isn't always recorded or we don't know the ethnicity of the owner, e.g.).
The missing data is not a problem.
As for modeling the monthly/seasonal trends in energy usage, I suspect that after controlling for the season variable (via a couple contrast codes, e.g., shoulder vs. non-shoulder and summer vs. winter), there won't still be much dependence due to the individual months, so they will just be considered as repeated measures within each season.
Originally Posted by bugman
I would like to know , since I am not a coding "master" what is the recode function actually doing here?
It is just a convenient way of converting between numeric and factor types and vice versa when dealing with factors that have several levels. The syntax in the recode function is basically like "if it's A, turn it into X; if it's B, turn it into Y; if it's anything else, turn it into Z;" etc.
“In God we trust. All others must bring data.”
~W. Edwards Deming
absolutely awesome simulations! Very insightful. I even pulled the other student over so I could explain how awesome the stuff was, and he was like "yeah, that's way over my head" because like most crappy econ programs, they just teach OLS and done lol
Busy day today, so I'll have more to say later, and I'm still actually doing some exploratory analysis on our data set (lots of data, nobody has analyzed it! I can spend years at this place looking through this stuff!).
In any case, I ran a model on the logs of KWh and income (plus 1 on the entire data set to avoid 0 issues) both with OLS and with a few multilevel models and the estimates on income were pretty much the same, around 0.00527 vs like 0.00522. The result is not unexpected. I honestly don't see income having any short-term influence on energy consumption, period. Other research apparently supports a very low income elasticity to energy consumption in the short-run.
EDIT:
I'll add that I may look into centering as I explore our data. We do have, obviously, values at 0. Though, this raises some questions about our data. Do people really use no energy? Should they be excluded? There's also pretty much people that use 1 KWh, 2 KWh, ... which begs the question, too. I have no doubt there's just a lot of bad data here. The income being 0 isn't unusual because we're focusing on our economically disadvantaged contracts and how best to serve them and all our customers. We do have some, however, that have BS incomes listed. Like 8 are over $100,000 a month? Riiiight. As it stands, we have nearly 100,000 contracts in our data set.
@ Mr BG. Some friends of mine a few years ago were investigating electricity consumption in households. But it turned out, they noticed this quite late after they had started, that the data quality was very low. Like 5% or 10% of the data values were completely wrong. Consumption could be zero in February and March and then all of it was “attributed” to April. Customer weren’t cheated but data got wrong. The measurement system had all kind of peculiarities. It might be the same difficulties in electricity companies all over the world.