geoid income race familysize
06 1000 white 2
06 1200 black 1
06 1500 asian 1
06 2000 white 3
06 800 white 1
07 3000 white 2
07 1200 black 1
07 2500 asian 5
07 2000 white 3
07 800 black 2
08 100 white 1
08 4200 black 1
08 1500 asian 2
08 1000 asian 3
08 800 white 1
I want to reshape it for mapping purposes in a GIS, which would be better served by this
Code:
geoid meanincome white black asian meanfamily
06 1300 3 1 1 1.6
07 1900 2 2 1 2.6
08 1520 2 1 2 1.6
Does that make sense? I was trying to do this in base R, as I know there are the plyr and reshape packages out there to make this a bit easier.
(1) I'm still not familiar with those packages since I never use them
(2) I'm still interested in a base approach.
I initially thought you could pass multiple functions, sort of like I can do with xtabs or something. Then I thought I'd get creative and create a function that operates on the multiple fields appropriately, like
Unfortunately, aggregate does some weird **** when it passes arguments! (seriously, try a print statement or something in your function). My next guess is to try doing something with by or multiple tapply, but I thought I'd see what ya'll thought were efficient ways to handle this situation.
Re: Reshape and Aggregation - Long to Wide Format with Summaries
I could do something pretty ugly too, the point was efficiency! :P
This prototyping was for a quick example. There's a dozen or so variables with a variety of factors at most of them that I'm going to repeat this process for, and I'll have to do it at different levels of the GEOID (it encodes blocks, block groups, census tracts, counties, and states). I like the table idea, though. I'll have to investigate that. I forget that when I'm doing counts, it's better than summary (but I always think of summary, first). I want to avoid merge, though. It's ridiculously slow! I have another script where I did it twice and if I got rid of it, I could probably put my 30 second process down to 10! The problem is doing a dual matching instead of a single field match (for which I created a wrapper vlookup that basically simulates vlookup in Excel, and I use that to recode factor levels on the spot). I have a lookup table for a given rate class during a given season of the year. It has to match both of them to grab a given KWh minimum value for a classification. I used merge because it's straight-forward. But that's another problem altogether.
Re: Reshape and Aggregation - Long to Wide Format with Summaries
I haven't tested this solution for efficiency, but it at least has the advantages of being straightforward and transparent in terms of implementation:
Code:
# first, this guy here wants to be numeric
dat$familysize <- as.numeric(as.character(dat$familysize))
do.call(rbind, by(dat, dat$geoid, function(x){
c(meanincome = mean(x$income),
meanfamily = mean(x$familysize),
table(x$race))
}))
# meanincome meanfamily asian black white
# 06 1300 1.6 1 1 3
# 07 1900 2.6 1 2 2
# 08 1520 1.6 2 1 2
An important thing to notice about the returned data frame here is that the grouping variable, geoid, gets put into the rownames rather than being included as one of the variables. Putting it back into the data frame is of course simple if you decide you need to do this.
In God we trust. All others must bring data.
~W. Edwards Deming
Re: Reshape and Aggregation - Long to Wide Format with Summaries
I like that solution. It's what I was going to try once Dason reminded me of table for counting categorical variables. Looks pretty nice. I will need to bind it to the GEOID since I'm going to push this into a SQLite (spatialite) database where I put my TIGER shapefiles (and subsets for my business area). Then I can join the table by the GEOID and start doing maps based on these characteristics. I'll just have to come up with my own summary function that aggregates the fields the way I want them aggregated. If I do some conditional logic, I may be able to break it into two steps: if factor/character do table, else take mean/median. This way I won't have to know what I'm summarizing. It just gets done! This calls for running apply on the by function
Only problem I see above is that apply seems to turn 'x' into a matrix, which can only be one data type, which in this case forces everything to be character ... hmmm
Re: Reshape and Aggregation - Long to Wide Format with Summaries
I haven't timed it or even know if it's practical for your actual situation (since you have to construct each column for race separately) but data.table might be quite quick?
Re: Reshape and Aggregation - Long to Wide Format with Summaries
That data.table syntax is pretty nice, but this still requires a lot of coding and knowing the levels in your factors. As I said, I have about a dozen characteristics, most of them factors, some with a variety of levels. That would mean I'd have to sit there and code out maybe 30 sum statements, which isn't feasible, especially if I want to do any recoding. My approach above finds the relevant information, but it isn't in the good format that Jake provided in his method. The problem is his approach does each field as its own vector and concatenates them, while mine treats them more tabular and returns such an object. That's because I used cbind. If I replace it with c, I get the desired information for each geoid, but the field names are like race.asian, race.black, etc.
I guess what I could do is add a post processing to remove that portion of the name ... hmmm
Code:
names(x) <- gsub(".*\\.", "", names(x))
This works at renaming the fields, but the outer do.call (rbind) doesn't respect the names that by maintains.
Re: Reshape and Aggregation - Long to Wide Format with Summaries
I don't know how much of a speed gain data.table even gets in a situation like this (if any) but you could probably construct the relevant data.table statement programmatically and use eval. But that's not very clean and probably too much work especially if you already have some working solutions.
I don't use data.table but I'm wondering if there is a way to essentially construct those columns automatically...
"His programming is malfunctioning. It begins! Get your weapons, he's going to become a killbot!!!" - bryangoodrich
Re: Reshape and Aggregation - Long to Wide Format with Summaries
Actually, I take that back. It's the data.frame version of rbind I was using that screws it up (**** you data.frame and your renaming my stuff!). This works beautifully and automates the complexity of the data summarization.
Code:
# Read in the prototyping data
df = data.frame(scan(what = list(geoid = "", income = 0, race = "", familysize = 0)))
06 1000 white 2
06 1200 black 1
06 1500 asian 1
06 2000 white 3
06 800 white 1
07 3000 white 2
07 1200 black 1
07 2500 asian 5
07 2000 white 3
07 800 black 2
08 100 white 1
08 4200 black 1
08 1500 asian 2
08 1000 asian 3
08 800 white 1
# Summarizes a field based on its class
summarize <- function(x) {if (is.character(x) | is.factor(x)) table(x) else mean(x)}
# Applies the field summary to each field in the geoid frame passed here, but returns a properly named vector
f <- function(geoid) {
x = do.call('c', lapply(geoid, summarize))
names(x) <- gsub(".*\\.", "", names(x))
return(x)
}
do.call('rbind', by(df[-1], df$geoid, f))
The result?
Code:
income asian black white familysize
06 1300 1 1 3 1.6
07 1900 1 2 2 2.6
08 1520 2 1 2 1.6
Yes, I need to give a better name to the 'f' function, but I'm not creative this morning lol
Note: This has the problem that the GEOID are in the rownames, but this is an easy fix
Code:
x <- do.call('rbind', by(df[-1], df$geoid, f))
x <- transform(x, geoid = rownames(x)) # return as factor
Re: Reshape and Aggregation - Long to Wide Format with Summaries
If you're looking for pure speed I think the split lapply approach will always win out. That basically means taking jake's answer and doing a few tweaks:
Re: Reshape and Aggregation - Long to Wide Format with Summaries
That doesn't account for how long it takes to change the code when the data set changes. Mine adjusts automagically!
i.e., my solution applies to any tabular data set!
Now I should add an options variable for the numeric fields just in case I want to do a different summary (e.g., median). Even better, let the user pass that function, just in case maybe I want to calculate something's geometric mean, instead. Even better better, let the user pass more than one function and have it return each of those summaries. Something like
Code:
tabulate(. ~ geoid, df, list(mean, median, max))
That would be a general solution. What summaries might one want to also do for factors in this case? I'm thinking maybe proportions. Have it return the count divided by the respective geoid nrow value. If one has a vector of each geoid nrow (not hard to calculate: tapply(df[, 1], df$geoid, length)), one could also convert back and forth afterwards.