Data Management Tools and Utilities


Probably A Mammal
While this will undoubtedly be a Linux-centric discussion, I'm sure, I thought I'd keep it general enough to include anything Windows might have that is useful. I'm curious to know what tools are available out there for data management and manipulation. Often we rely on doing things in our statistical software (e.g., SAS or R) or manually manipulating data sets in applications (e.g., Excel or Access). It goes without saying that other programmatic languages exist to aid in these tasks (e.g., Perl and Python). I'm interested in compiling a list of "utilities" that can also be of use. I have mentioned some of them before, but there exist quite a few, and I'm only talking about *nix based tools! What else have you guys encountered? Eventually, I plan to put this list with hyperlink resources on my blog for reference.

To name a few in Linux:
head -- Displays first few lines of data
tail -- Displays last few lines of data
sort -- obvious
uniq -- remove/keep duplicate records; can tabulate (count) them
cut  -- maintain specified columns; can field delimit
grep -- regular expression awesomeness
awk  -- powerful tool for formatting files
sed  -- powerful tool with concise syntax for editing streams of data
More basic functions exist to gather information
ls    -- list files and directories; use '-lh' flag for user-friendly reading
wc    -- word count displays the triple: lines, words, characters
paste -- put the contents of two files together, separated by a tab
join  -- like SQL join ('merge'), it matches data into a single table based on first field
As already stated, one could throw in two of the more powerful languages: perl and python. These are your Swiss army knife when it comes to data manipulation. Of course, one could revert to R, but it suffers in certain areas for which the use of these other utilities may prove more efficient, at least in preprocessing the data for further manipulation and analysis. Work flow can be made even more complex (complicated?) when you consider how these can interact (e.g., calling R from Python or calling system utilities from within R).

I welcome any suggestions for what Windows might have to offer. Honestly, the best Windows resource I can think of is its VBA, JScript or VBScript. A lot of these *nix tools have ports to Windows in some way. Some are direct--e.g., gawk for windows, Perl, and Python. Others are indirect--e.g., Cygwin.

The real challenge is tying it all together. In Linux, the shell (terminal/prompt) is very powerful. It's shell scripting languages are powerful. The inclusion of C coding and built-in compilers (I'm looking at you gcc) make it even more powerful. For instance, say I want to break up a fixed-width file into a tab delimited file. I could create a shell script that takes in a sequence of numbers that specify the field widths. It then loops through using these as the basis for a series of cut commands that along with paste put them together separated by a tab, ultimately being written to the standard output or redirected to a file. Of course, one could also create a small awk program to do the same thing (I'll post my version on my website once I create the shell scripting interface to make it more convenient for multi-file use). There's also sed and Perl equivalents to this process. I wrote a Python version, myself. R has a function that is very inefficient for large files, and I actually wonder how efficient I can make it using more basic R functions.

In the final analysis, what is important is what tools you know. I don't use sed because, frankly, I don't know it. Whatever I can do in it, I can probably do as efficiently in awk, which is much more easy to pick up given its very C-esque syntax. Nevertheless, some tools are easy to pick up and easy to use. Being aware of them may help you find easier solutions to problems you face.