Grab Disfigured Data

bryangoodrich

Probably A Mammal
#1
Here's a practical challenge for my fellow R loving TS'ers. I'm going to tackle it tomorrow, and I have some ideas. I thought it could be fruitful to see how others might approach it, though.

I need to grab some unemployment data, and frankly I'd like to keep all of it (and manipulate it accordingly once it's in R), but I only really need is the FIPS (state and county combined) and the unemployment level and rate (or at least level. I'm going to do a dot map in ArcGIS). I also need year as I'm going to combine a decade's worth of these files.

All of the files follow the format of this example.


I'm thinking of putting together a widths vector (tedious, but once it is done, it's done!), and using read.fwf with url while looping through a vector 90:99 (the years of my study) that pastes together the URL for each grab, combining the files into one large decade file. I could do something with scan or readLines, as they are more primitive (low-level), but I think read.fwf is sufficient, if not slow as all hell. Otherwise, something like readLines with strsplit and some fancy regex should do the job, but definitely be a bit more complicated!

What say you?
 

trinker

ggplot2orBust
#2
the read.fwf is the approach I'd take with this much data unless each page is a different vector of fixed values. Counting each one would be a pain. I bet there's a way to make a fixed width counter in R by pasting in one row from the data and having R count from beginint of one column to the next. Then again some data including yours has right and left justified data. Also you'll need to clean up the commas with gsub. In that case maybe use textConnection() to get the text into read.table().

In all seriousness if you only need 9 years and you have to go through all this crap another approach would be to save all 9 as 9 text files then open excel and click file-> open-> textfile and there's an import wizzrd that is very similar to the import wizzard of SPSS. I Just did the file you provided and it took me just over 4 minutes to save to text, rename columns, import into excel, save as csv and import into R iwth read.csv. This may actually be a faster approach considering the comma clean, column width counting and column renaming. It may be faster to do it by hand than to futz around with this unless you plan on needing this again.

My thoughts.

Just out of curiousity if you do loop through with read.fwf approach how will you approach the odd column name structure?
 

bryangoodrich

Probably A Mammal
#3
I'd skip the first 6 lines and use my own column names :p

All the data is fixed width, and read.fwf should clear out the white spaces (at least, I think there's a parameter for that, if not). I already was reading in one line tonight to figure out the positions I would need to use to get the bear minimum data, but I know there's a regex approach I could use to clean up the data if I wanted to make it complex. Basically, the only field that gives problem is the name of the county (because it has spaces, a comma, etc.). However, everything else is separated by more than one space. Thus, an expression that would say treat as one field anything within two white spaces from each other, split otherwise, that would do the trick.

It does seem easier to just determine the widths and use read.fwf. I've never used textConnection or read.table before. I wouldn't even know how to approach it from that view. There are zipped up excel files I could grab, and I was thinking I could write a JScript or VBScript algorithm to go through and write out each one to csv for easier importing, but that would require too much learning to get this done. I want to finish up my analysis, and this is just sort of a last check and it would prove to be a useful map to produce to finish off my report.

I envision this process should be a simple as

Code:
# Set up variables
widths   <- someWidthsToFigureOut   # To divide fwf when read in
classes  <- someClassesToFigureOut  # To assign classes when read in
header   <- someNamesToFigureOut    # To assign names when read in
years    <- 90:99                   # Years of interest
path     <- "ftp://ftp.bls.gov/pub/special.requests/la/"  # Base URL
filename <- "laucnty"               # Series name for file
df.all <- NULL                      # Identify and invoke the final data frame

# Loop through files to read and union
for (i in years) {
  infile <- paste(path, filename, i, ".txt", sep = "")  # Set up the complete URL
  df <- read.fwf(url(infile), widths = widths, skip = 6,
                 colClasseses = class, col.names = header)

  df <- transform(df,
    someColumn = as.numeric(gsub(",", "", someColumn))  # Repeat for all the relevant columns
  );  # end transform

  df.all <- rbind(df.all, df)                           # Append it to the bottom of the final data frame
}  # end for
Then I need to check if I kept my county level NCHS mortality data (three mortality endpoints: all-cause, endocrine, and suicide), clean that up, and then merge it to these for the given FIPS and year to create three county level data for each endpoint. Oh, the fun of data management!
 

trinker

ggplot2orBust
#4
Bryan also does fixed width[ read.fwf] work when the columns are both left and right justified?

Bryan this is pretty interesting to me. When you have a usable code that worked would you mind posting it so I could try it out as well. I didn't know you could even skip lines. I thought you could only do header true or header false.

Not really a deep intelelctual problem but a fun problem I'm sure most of us will encounter evenetually.
 

bryangoodrich

Probably A Mammal
#5
Oh yeah, for most of the input functions you can tell it to skip a certain line, read in a certain amount of lines, etc. When reading in large data sets, for instance, you don't want to try reading it in all at once. Usually you'll do it in chunks with something like

Code:
while not EOF
  read in block
  process block
  write out new block
end while
read in cleaned data
This is why I don't particularly like read.fwf. I don't think it's algorithm is robust. It basically takes your stuff, manipulates it, writes it out, which all in R is very slow. Then when it's done, it read.table it back in. I'm actually thinking of constructing my own C fwf2csv command-line function. I've done one with Python, and I think it went pretty quick. Still, something low-level would be the fastest. Frankly, if I knew Perl, it would probably be the best choice. In R, the best choice would be something low-level like scan or readLines. For data that can fit comfortably in memory, just read it into the environment, clean it up, and put it together into a final data set. That would bypass the read-write-read of read.fwf. I read in a 250,000 x 74 sized data set; it took 10 minutes! While it was more convenient than my Python script, it clearly was not fast lol

As for justified data, it shouldn't matter when you strip out white spaces as I demonstrate below. You can have something like

Code:
|some left justified stuff      |      some centered stuff      |        some right justified stuff|
The space between "|...|" will be well-defined widths. The stuff inside should get compressed so you return, say,

Code:
some left justified stuff,some centered stuff,some right justified stuff
I suggest making some test data to try it out yourself. Here's some data at widths = c(5, 10, 23).

Code:
10       AAA                     frank
20       BBB                       joe
30       CCC                     chris
40       DDD                       bob
50       EEE                    sandra
This should read it in fine after you copy the above,

Code:
x <- read.fwf("clipboard", widths = c(5, 10, 23), strip.white = TRUE)  # strip.white is 'read.table' parameter
This can be fun stuff. It's basically what I've learned over the past 3 years of doing data manipulation. I'm not skilled enough to do the GLM and time series type models we do research with. It is kind of why I want to get another job where I can do actual research and improve those skills! I'm kind of up against a ceiling here, so to speak. This school project has rekindled my desire for research, as I do find it fun to not just grab and manipulate data, but to actually do something with it and produce a visualization that is meaningful out of it. That's what statistics is about! Right now, I'm just a **** technician lol
 

bryangoodrich

Probably A Mammal
#6
Right now the below code should work, but for some reason it only maintained the last year. It finished in around 1.5 minutes, though. I created the widths by copying the first line into MS Word, activating columns on the bottom panel. Shrink the font so everything fits on one line and put your cursor to the right-justified position for each of the columns. However, this is 1 greater than what is true, which is why you'll see the "... - 1" on the first line. However, this gives positions. I need the distance between the positions. Therefore, I use a quick for loop and replacement algorithm to get the job done. I'm not happy, syntax-wise, with my increment. I prefer "i in seq(widths)", but then I end up with an empty 10th bin on my widths vector. I could put a line in there to remove it, but since it should come out of the algorithm, I altered the increment to "i in 1:(length(widths)-1)". I find it hideous, but I can't think of a better way to express the same semantics.

Note, as I found out, don't put together your quoted vectors in Word. They use their **** non-ASCII quotes! I had to convert them afterward. Better to just stick to notepad lol

Code:
widths   <- c(9, 14, 22, 75, 79, 93, 106, 117, 126) - 1
for (i in 1:(length(widths)-1))  # Each distance is equal to its position minus the sum of its predecessors
  widths[i+1] <- widths[i+1] - sum(widths[1:i])
classes  <- c(rep("factor", 5), rep("character", 3), "numeric")
header   <- c("series_id", "sFIPS", "cFIPS", "name", "year", "labor", "emp", "unemp", "unrate")
years    <- 90:99      
path     <- "ftp://ftp.bls.gov/pub/special.requests/la/" 
filename <- "laucnty"  
df.all <- NULL         

for (i in years) {
  infile <- paste(path, filename, i, ".txt", sep = "")
  df <- read.fwf(url(infile), widths = widths, skip = 6, col.names = header,
                 colClasses = classes, strip.white = TRUE)

  df <- transform(df,
    FIPS  = factor(paste(sFIPS, cFIPS, sep = "")),
    labor = as.numeric(gsub(",", "", labor)),
    emp   = as.numeric(gsub(",", "", emp)),
    unemp = as.numeric(gsub(",", "", unemp))
  );  # end transform

  df.all <- rbind(df.all, df)  
}  # end for
I don't know why my aggregation method is failing. The df.all is identified before the loop, so it shouldn't be identified within the scope of the loop. It works in the instance of this example

Code:
y <- NULL
x <- data.frame(bob = rnorm(10), joe = runif(10))
(y <- rbind(y, x))
(y <- rbind(y, x))
Yet, this isn't happening in my loop. I was thinking maybe I need to identify it first as a data frame, since maybe the dynamic type setting invoked within the loop will identify a new variable (same identifier) within the scope of the loop, but (1) scope doesn't matter here, since it exists afterward anyway, and (2) my contrived example above has the same properties.

I'll edit the above script once I find out the problem.
 

bryangoodrich

Probably A Mammal
#7
Nevermind, I'm an idiot. I was looking at df and not df.all when I was done lol. The script above works.

Result

Code:
df <- df.all[-c(1:4)]
df <- df[c(6, 1:5)]
head(df)
#    FIPS year labor   emp unemp unrate
# 1 01001 1990 16875 15853  1022    6.1
# 2 01003 1990 46773 44492  2281    4.9
# 3 01005 1990 11458 10619   839    7.3
# 4 01007 1990  7408  6776   632    8.5
# 5 01009 1990 19130 18001  1129    5.9
# 6 01011 1990  4381  3869   512   11.7
Though, I might do, as I did with my state level data, transform(df, unrate = unemp / labor * 100) to get more precision from their rounded results.
 

trinker

ggplot2orBust
#8
I myself have never had to work such a large problem, however I have a situation involving a state test bank data set stored in multiple locations that this would be very applicable to. Thanks for sharing and the import seemed to work for me so i'm not sure why it didn't for you.

EDIT: I see it did work for you you just tried to recall the wrong thing :)
 

bryangoodrich

Probably A Mammal
#9
If you want to follow along, you can also get the public mortality data from the CDC's WONDER web portal. The problem is that low counts will be excluded, which in aggregate will cause a big difference! However, what I'm going to do is look at a dot map of where deaths occur in the counties. Low counts will just not show up anyway. The way this works is that deaths are plotted as a point per, say, 100 deaths in a county, randomly within the county boundary. The counties won't be shown, though, only states. This way, we'll see visual clustering as there are a good amount of deaths within a state. The below code reads in the exported WONDER data I collected for county level data All-cause, Endocrine (whole group) and Suicide (under external causes of death > self-inflicted).

While this is not the cleanest, it might prove useful to you in your own cases. For instance, defining your classes to eliminate some from the read, and using that to define your column names, since it needs to match the classes (NA when there is a NULL is appropriate; though I don't know why that's not just built into read.table!).

Code:
classes <- c("NULL", "factor", "NULL", "NULL", "factor", "numeric", rep("NULL", 3))
header  <- classes
header[which(classes == "NULL")] <- NA
header[which(!classes == "NULL")] <- c("year", "FIPS", "deaths")

ACM <- read.delim("ACM_county.txt", colClasses = classes, col.names = header)
ECM <- read.delim("ECM_county.txt", colClasses = classes, col.names = header)
SUI <- read.delim("SUI_county.txt", colClasses = classes, col.names = header)

# Select the required years
ACM <- subset(ACM, year %in% factor(1990:1999))
ECM <- subset(ECM, year %in% factor(1990:1999))
SUI <- subset(SUI, year %in% factor(1990:1999))

# Drop unused levels
ACM <- transform(ACM, year = factor(year))
ECM <- transform(ECM, year = factor(year))
SUI <- transform(SUI, year = factor(year))

# Combine with BLS data
ACM <- merge(ACM, df, by = c("FIPS", "year"))
ECM <- merge(ECM, df, by = c("FIPS", "year"))
SUI <- merge(SUI, df, by = c("FIPS", "year"))

# Export data for mapping
write.csv(ACM, file = "ACM_county.csv", row.names = FALSE)
write.csv(ECM, file = "ECM_county.csv", row.names = FALSE)
write.csv(SUI, file = "SUI_county.csv", row.names = FALSE)
Of course, if I were doing a larger project, I'd keep all these data in respective folders or use some other heuristic that would allow me to loop through the directories in such a way to do this in a few lines of code. Since this is small enough, that level of organization and succinctness is unnecessary. The factor of 3 going on here is trivial.
 

trinker

ggplot2orBust
#10
Bryan,

It took me about 56 second using system.time on a windows 7 machine. Pretty impressive. Last night I told you it took me about 4.5 minutes to retrieve clean import and clean a file for just one year. That would have been roughly 45 minutes to import everything in by hand. I'm trying to approach this pragmatically.

First question: Did it take you more than 45 minutes to write the code and work the kinks out to import the data?

First point: Your methid is less likely to result in error of cleaning.

Second question: Do you work on the file in R's memory or then save the massive data frame of 32180 as soon as you import(which isn't massive to some but to someone who's luck to work with 300 observations it's massive) to a .csv file, txt etc?


I'm trying to get a sense of your work flow as you've pointed out you're becoming efficient at because that's all you get to do :)
 

bryangoodrich

Probably A Mammal
#11
I don't know what my time frame was, because I've been at work for like 2 hours, did some GIS stuff, went to get coffee, played around on here and Facebook. To quote Office Space, I think I only did 15 minutes of actual work lol
 

bryangoodrich

Probably A Mammal
#12
But I know for a fact it didn't take me 45 minutes to figure out my widths, come up with my algorithms, etc. I already had my mortality data collected, and it took me, what 10 minutes now to finish processing it. The BLS stuff went pretty quick once I realized I was an idiot lol
 

trinker

ggplot2orBust
#13
In all seriousness a fixed width finder would come in pretty handy, in which you paste in a few lines from the text file and then R counts the spaces and maybe takes the median or mean of the number of white spaces in between columns (round it of course) and R prints out the vector (thanks for showing me that fwf is a bit looser than It hought in its specifications). Depends on how often you do fwf of whether or not this would pay off.

One could use the textConnection to cut and paste the lines similarly to what I have in my ascii function that I keep in my .First library:

Code:
ascii<-function(x, header=TRUE,...){
    name <-textConnection(x)
    DF <- read.table(name, header, ...)
    close(name)
    on.exit(closeAllConnections())
    DF
}
 

bryangoodrich

Probably A Mammal
#14
Well, from my experience dealing with a lot of demographic or epidemiological data, the government loves to use fixed-width formats! If you deal with vital statistics, they're pretty much always fixed-width. Part of the reason is tradition: it was efficient back in the day. Another reason is security. If you look at 74 columns of data, each a jumbled string of stuff, it is pretty much meaningless to you without a codebook. Therefore, having a good handle on fixed-width files is useful when dealing with large government data that tends to be in that format. It's also handy because a lot of the time, like the BLS data, it comes as "presented" with a fixed-width-like format, as in today's example (it wasn't really fixed-width, but it was right-justified text that we could treat as-if it were a fixed-width file).

Ugh, I need to get better at Python programming with ArcGIS. I remembered I have no way of managing the join of my data with my shapefiles to extract certain dates. So I updated my earlier code to write out two sets: one for 1992 and one for 1996. I then had to go through and join 6 different files to 6 copies of the same county boundary TIGER files, and then export that data because I want to have them in their own shapefiles. Using a mix of Python and R, I could have streamlined this from my R code to Python doing the join and export (Python can invoke R), to make it all one fluid process. I know it can be done. I just don't know how to do it yet!
 

bryangoodrich

Probably A Mammal
#15
Trinker,

If you want more practice with reading data, Duncan Temple-Lang has an assignment that does just that. It isn't too complicated, but it does require you to think about how to go about grabbing the data, parsing it, and combining it. Data are not always in a nice tabular format. Often we have to parse out information by indicators (e.g., in "name:value" tags) or ignore header/footer information that comes along with data (e.g., the NCHS data through WONDER is exported with lots of footer information about the query, and while I usually don't have so much I need to batch process it out, I usually end up combing stuff and manually just avoid copying it in Excel; the point is that an efficient means would be to make a converter function--which I should do--that at least extracts the relevant data, avoid totals, and do something with "unreliable" fields that come along with it (ruining what should be numeric fields!)).

If you want to work on that assignment with me, I'd be down. I'm not in the class, but I'm going to follow his work, in case I ever get to UCD, I'll be ready! lol