+ Reply to Thread
Results 1 to 11 of 11

Thread: Reshape and Aggregation - Long to Wide Format with Summaries

  1. #1
    Probably A Mammal
    Points: 14,507, Level: 78
    Level completed: 15%, Points required for next Level: 343
    bryangoodrich's Avatar
    Location
    Sacramento, California, United States
    Posts
    1,956
    Thanks
    223
    Thanked 419 Times in 387 Posts

    Reshape and Aggregation - Long to Wide Format with Summaries



    For prototyping, let's consider this data set

    Code: 
    structure(list(geoid = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 
    2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), .Label = c("06", "07", "08"), class = "factor"), 
        income = c(1000, 1200, 1500, 2000, 800, 3000, 1200, 2500, 
        2000, 800, 100, 4200, 1500, 1000, 800), race = structure(c(3L, 
        2L, 1L, 3L, 3L, 3L, 2L, 1L, 3L, 2L, 3L, 2L, 1L, 1L, 3L), .Label = c("asian", 
        "black", "white"), class = "factor"), familysize = structure(c(2L, 
        1L, 1L, 3L, 1L, 2L, 1L, 4L, 3L, 2L, 1L, 1L, 2L, 3L, 1L), .Label = c("1", 
        "2", "3", "5"), class = "factor")), .Names = c("geoid", "income", 
    "race", "familysize"), row.names = c(NA, -15L), class = "data.frame")
    That looks like this

    Code: 
    geoid	income	race	familysize
    06	1000	white	2
    06	1200	black	1
    06	1500	asian	1
    06	2000	white	3
    06	800	white	1
    07	3000	white	2
    07	1200	black	1
    07	2500	asian	5
    07	2000	white	3
    07	800	black	2
    08	100	white	1
    08	4200	black	1
    08	1500	asian	2
    08	1000	asian	3
    08	800	white	1
    I want to reshape it for mapping purposes in a GIS, which would be better served by this

    Code: 
    geoid	meanincome	white	black	asian	meanfamily
    06	1300		3	1	1	1.6
    07	1900		2	2	1	2.6
    08	1520		2	1	2	1.6
    Does that make sense? I was trying to do this in base R, as I know there are the plyr and reshape packages out there to make this a bit easier.

    (1) I'm still not familiar with those packages since I never use them

    (2) I'm still interested in a base approach.

    I initially thought you could pass multiple functions, sort of like I can do with xtabs or something. Then I thought I'd get creative and create a function that operates on the multiple fields appropriately, like

    Code: 
    aggregate(cbind(income, race) ~ geoid, data, function(x) {cbind(mean(x$income), summary(race))})
    Unfortunately, aggregate does some weird **** when it passes arguments! (seriously, try a print statement or something in your function). My next guess is to try doing something with by or multiple tapply, but I thought I'd see what ya'll thought were efficient ways to handle this situation.

    Thoughts?

  2. #2
    RotParaTon
    Points: 46,287, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    Discussion EnderPosting AwardFrequent PosterCommunity AwardMaster Tagger
    Dason's Avatar
    Location
    Ames, IA
    Posts
    9,083
    Thanks
    211
    Thanked 1,609 Times in 1,379 Posts

    Re: Reshape and Aggregation - Long to Wide Format with Summaries

    If I need to come up with a solution in 15 seconds this is what I would do...

    I would use ddply (or aggregate or whatever you want) to get the meanincome and meanfamily.

    I would use tapply along with table to get the three columns of counts for race. 'cast' from reshape could be used for this too though

    Code: 
    dat <- structure(list(geoid = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 
    2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), .Label = c("06", "07", "08"), class = "factor"), 
        income = c(1000, 1200, 1500, 2000, 800, 3000, 1200, 2500, 
        2000, 800, 100, 4200, 1500, 1000, 800), race = structure(c(3L, 
        2L, 1L, 3L, 3L, 3L, 2L, 1L, 3L, 2L, 3L, 2L, 1L, 1L, 3L), .Label = c("asian", 
        "black", "white"), class = "factor"), familysize = structure(c(2L, 
        1L, 1L, 3L, 1L, 2L, 1L, 4L, 3L, 2L, 1L, 1L, 2L, 3L, 1L), .Label = c("1", 
        "2", "3", "5"), class = "numeric")), .Names = c("geoid", "income", 
    "race", "familysize"), row.names = c(NA, -15L), class = "data.frame")
    
    #mn <- ddply(dat, .(geoid), summarize, meanincome = mean(income), meanfamily = mean(familysize))
    mn <- aggregate(cbind(income, familysize) ~ geoid, mean, data = dat)
    rc <- as.data.frame(do.call(rbind, tapply(dat$race, dat$geoid, table)))
    rc$geoid <- rownames(rc)
    output <- merge(mn, rc)
    
    #alternatively forget about adding the geoid column to rc and just do this...
    cbind(mn, rc)
    Very kludgy but it works?
    "His programming is malfunctioning. It begins! Get your weapons, he's going to become a killbot!!!" - bryangoodrich

  3. #3
    Probably A Mammal
    Points: 14,507, Level: 78
    Level completed: 15%, Points required for next Level: 343
    bryangoodrich's Avatar
    Location
    Sacramento, California, United States
    Posts
    1,956
    Thanks
    223
    Thanked 419 Times in 387 Posts

    Re: Reshape and Aggregation - Long to Wide Format with Summaries

    I could do something pretty ugly too, the point was efficiency! :P

    This prototyping was for a quick example. There's a dozen or so variables with a variety of factors at most of them that I'm going to repeat this process for, and I'll have to do it at different levels of the GEOID (it encodes blocks, block groups, census tracts, counties, and states). I like the table idea, though. I'll have to investigate that. I forget that when I'm doing counts, it's better than summary (but I always think of summary, first). I want to avoid merge, though. It's ridiculously slow! I have another script where I did it twice and if I got rid of it, I could probably put my 30 second process down to 10! The problem is doing a dual matching instead of a single field match (for which I created a wrapper vlookup that basically simulates vlookup in Excel, and I use that to recode factor levels on the spot). I have a lookup table for a given rate class during a given season of the year. It has to match both of them to grab a given KWh minimum value for a classification. I used merge because it's straight-forward. But that's another problem altogether.

  4. #4
    Cookie Scientist
    Points: 5,945, Level: 49
    Level completed: 98%, Points required for next Level: 5
    Jake's Avatar
    Location
    Boulder, CO
    Posts
    797
    Thanks
    18
    Thanked 315 Times in 241 Posts

    Re: Reshape and Aggregation - Long to Wide Format with Summaries

    I haven't tested this solution for efficiency, but it at least has the advantages of being straightforward and transparent in terms of implementation:
    Code: 
    # first, this guy here wants to be numeric
    dat$familysize <- as.numeric(as.character(dat$familysize))
    
    do.call(rbind, by(dat, dat$geoid, function(x){
      c(meanincome = mean(x$income),
        meanfamily = mean(x$familysize),
        table(x$race))
      }))
    #    meanincome meanfamily asian black white
    # 06       1300        1.6     1     1     3
    # 07       1900        2.6     1     2     2
    # 08       1520        1.6     2     1     2
    An important thing to notice about the returned data frame here is that the grouping variable, geoid, gets put into the rownames rather than being included as one of the variables. Putting it back into the data frame is of course simple if you decide you need to do this.
    “In God we trust. All others must bring data.”
    ~W. Edwards Deming

  5. The Following User Says Thank You to Jake For This Useful Post:

    bryangoodrich (08-11-2012)

  6. #5
    Probably A Mammal
    Points: 14,507, Level: 78
    Level completed: 15%, Points required for next Level: 343
    bryangoodrich's Avatar
    Location
    Sacramento, California, United States
    Posts
    1,956
    Thanks
    223
    Thanked 419 Times in 387 Posts

    Re: Reshape and Aggregation - Long to Wide Format with Summaries

    I like that solution. It's what I was going to try once Dason reminded me of table for counting categorical variables. Looks pretty nice. I will need to bind it to the GEOID since I'm going to push this into a SQLite (spatialite) database where I put my TIGER shapefiles (and subsets for my business area). Then I can join the table by the GEOID and start doing maps based on these characteristics. I'll just have to come up with my own summary function that aggregates the fields the way I want them aggregated. If I do some conditional logic, I may be able to break it into two steps: if factor/character do table, else take mean/median. This way I won't have to know what I'm summarizing. It just gets done! This calls for running apply on the by function

    Code: 
    by(df, df$geoid, function(x) {apply(x, 2, function(col) {if (is.character(col)) table(col) else mean(col)}})
    Only problem I see above is that apply seems to turn 'x' into a matrix, which can only be one data type, which in this case forces everything to be character ... hmmm

    This is a failed attempt

    Code: 
    summarize <- function(x) {if (is.character(x) | is.factor(x)) table(x) else mean(x)}
    do.call('rbind.data.frame', by(df[-1], df$geoid, function(geoid) {do.call('cbind', lapply(geoid, summarize)}))

  7. #6
    RotParaTon
    Points: 46,287, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    Discussion EnderPosting AwardFrequent PosterCommunity AwardMaster Tagger
    Dason's Avatar
    Location
    Ames, IA
    Posts
    9,083
    Thanks
    211
    Thanked 1,609 Times in 1,379 Posts

    Re: Reshape and Aggregation - Long to Wide Format with Summaries

    I haven't timed it or even know if it's practical for your actual situation (since you have to construct each column for race separately) but data.table might be quite quick?

    Code: 
    dat <- structure(list(geoid = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 
                                              2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), .Label = c("06", "07", "08"), class = "factor"), 
                          income = c(1000, 1200, 1500, 2000, 800, 3000, 1200, 2500, 
                                     2000, 800, 100, 4200, 1500, 1000, 800),
                          race = structure(c(3L, 2L, 1L, 3L, 3L, 3L, 2L, 1L, 3L, 2L, 3L, 2L, 1L, 1L, 3L), .Label = c("asian", "black", "white"), class = "factor"), 
                          familysize = structure(c(2L, 1L, 1L, 3L, 1L, 2L, 1L, 4L, 3L, 2L, 1L, 1L, 2L, 3L, 1L), .Label = c("1", "2", "3", "5"), class = "numeric")), 
                     .Names = c("geoid", "income", "race", "familysize"), 
                     row.names = c(NA, -15L), 
                     class = "data.frame")
    
    library(data.table)
    data_t = data.table(dat)
    ans = data_t[,list(meanfamily = mean(familysize), 
                       meanincome = mean(income), 
                       white = sum(race=="white"), 
                       asian = sum(race == "asian"),
                       black = sum(race == "black")), , by = c('geoid')]
    "His programming is malfunctioning. It begins! Get your weapons, he's going to become a killbot!!!" - bryangoodrich

  8. #7
    Probably A Mammal
    Points: 14,507, Level: 78
    Level completed: 15%, Points required for next Level: 343
    bryangoodrich's Avatar
    Location
    Sacramento, California, United States
    Posts
    1,956
    Thanks
    223
    Thanked 419 Times in 387 Posts

    Re: Reshape and Aggregation - Long to Wide Format with Summaries

    That data.table syntax is pretty nice, but this still requires a lot of coding and knowing the levels in your factors. As I said, I have about a dozen characteristics, most of them factors, some with a variety of levels. That would mean I'd have to sit there and code out maybe 30 sum statements, which isn't feasible, especially if I want to do any recoding. My approach above finds the relevant information, but it isn't in the good format that Jake provided in his method. The problem is his approach does each field as its own vector and concatenates them, while mine treats them more tabular and returns such an object. That's because I used cbind. If I replace it with c, I get the desired information for each geoid, but the field names are like race.asian, race.black, etc.

    I guess what I could do is add a post processing to remove that portion of the name ... hmmm

    Code: 
    names(x) <- gsub(".*\\.", "", names(x))
    This works at renaming the fields, but the outer do.call (rbind) doesn't respect the names that by maintains.

  9. #8
    RotParaTon
    Points: 46,287, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    Discussion EnderPosting AwardFrequent PosterCommunity AwardMaster Tagger
    Dason's Avatar
    Location
    Ames, IA
    Posts
    9,083
    Thanks
    211
    Thanked 1,609 Times in 1,379 Posts

    Re: Reshape and Aggregation - Long to Wide Format with Summaries

    I don't know how much of a speed gain data.table even gets in a situation like this (if any) but you could probably construct the relevant data.table statement programmatically and use eval. But that's not very clean and probably too much work especially if you already have some working solutions.

    I don't use data.table but I'm wondering if there is a way to essentially construct those columns automatically...
    "His programming is malfunctioning. It begins! Get your weapons, he's going to become a killbot!!!" - bryangoodrich

  10. #9
    Probably A Mammal
    Points: 14,507, Level: 78
    Level completed: 15%, Points required for next Level: 343
    bryangoodrich's Avatar
    Location
    Sacramento, California, United States
    Posts
    1,956
    Thanks
    223
    Thanked 419 Times in 387 Posts

    Re: Reshape and Aggregation - Long to Wide Format with Summaries

    Actually, I take that back. It's the data.frame version of rbind I was using that screws it up (**** you data.frame and your renaming my stuff!). This works beautifully and automates the complexity of the data summarization.

    Code: 
    # Read in the prototyping data
    df = data.frame(scan(what = list(geoid = "", income = 0, race = "", familysize = 0)))
    06	1000	white	2
    06	1200	black	1
    06	1500	asian	1
    06	2000	white	3
    06	800	white	1
    07	3000	white	2
    07	1200	black	1
    07	2500	asian	5
    07	2000	white	3
    07	800	black	2
    08	100	white	1
    08	4200	black	1
    08	1500	asian	2
    08	1000	asian	3
    08	800	white	1
    
    
    # Summarizes a field based on its class
    summarize <- function(x) {if (is.character(x) | is.factor(x)) table(x) else mean(x)}
    
    # Applies the field summary to each field in the geoid frame passed here, but returns a properly named vector
    f <- function(geoid) {
      x = do.call('c', lapply(geoid, summarize))
      names(x) <- gsub(".*\\.", "", names(x))
      return(x)
    }
    do.call('rbind', by(df[-1], df$geoid, f))
    The result?

    Code: 
       income asian black white familysize
    06   1300     1     1     3        1.6
    07   1900     1     2     2        2.6
    08   1520     2     1     2        1.6
    Yes, I need to give a better name to the 'f' function, but I'm not creative this morning lol

    Note: This has the problem that the GEOID are in the rownames, but this is an easy fix

    Code: 
    x <- do.call('rbind', by(df[-1], df$geoid, f))
    x <- transform(x, geoid = rownames(x))  # return as factor
    ta-da!

  11. #10
    FormerlyKnownAsRaptor
    Points: 24,442, Level: 95
    Level completed: 10%, Points required for next Level: 908
    Awards:
    Activity Award
    trinker's Avatar
    Location
    Buffalo, NY
    Posts
    3,174
    Thanks
    883
    Thanked 552 Times in 500 Posts

    Re: Reshape and Aggregation - Long to Wide Format with Summaries

    If you're looking for pure speed I think the split lapply approach will always win out. That basically means taking jake's answer and doing a few tweaks:

    Code: 
    do.call(rbind, lapply(split(dat, dat$geoid), function(x){
      c(meanincome = mean(x$income),
        meanfamily = mean(x$familysize),
        table(x$race))
      }))
    I microbenchmarked it using dason's two base solutions and also jake's and my own (shamelessly stolen from jake) as well as bryangoodrich's latest.

    "If you torture the data long enough it will eventually confess."
    -Ronald Harry Coase -

  12. The Following User Says Thank You to trinker For This Useful Post:

    Jake (08-12-2012)

  13. #11
    Probably A Mammal
    Points: 14,507, Level: 78
    Level completed: 15%, Points required for next Level: 343
    bryangoodrich's Avatar
    Location
    Sacramento, California, United States
    Posts
    1,956
    Thanks
    223
    Thanked 419 Times in 387 Posts

    Re: Reshape and Aggregation - Long to Wide Format with Summaries


    That doesn't account for how long it takes to change the code when the data set changes. Mine adjusts automagically!


    i.e., my solution applies to any tabular data set!

    Now I should add an options variable for the numeric fields just in case I want to do a different summary (e.g., median). Even better, let the user pass that function, just in case maybe I want to calculate something's geometric mean, instead. Even better better, let the user pass more than one function and have it return each of those summaries. Something like

    Code: 
    tabulate(. ~ geoid, df, list(mean, median, max))
    That would be a general solution. What summaries might one want to also do for factors in this case? I'm thinking maybe proportions. Have it return the count divided by the respective geoid nrow value. If one has a vector of each geoid nrow (not hard to calculate: tapply(df[, 1], df$geoid, length)), one could also convert back and forth afterwards.

+ Reply to Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts








Advertise on Talk Stats