PDA

View Full Version : Stata strategy needed: which datasets have variable X?



bgor
01-24-2011, 11:22 AM
Hello everyone,

I'm working with a large number of large datasets (roughly 150), and basically I need some potential strategies for automating having Stata figure out whether or not a certain set of variables exist across these datasets.

Basically, I want to write a .do (or .ado) file which opens each dataset, looks for a list of variables, and writes a 0 or 1 to a spreadsheet letting me know which variables exist in which datasets. It's a simple enough SOUNDING idea, but I've no clue how to actually do something like this in practice or if it's even possible.

Any help would be greatly appreciated!

bukharin
01-25-2011, 01:22 AM
One command that may be useful is:


describe using filename, varlist

This will describe the contents of "filename", and a list of the variables it contains will be contained in the return value r(varlist)

So you could then do, for an ugly example:


describe using filename, varlist
if strpos(r(varlist), "myvar")!=0 {
display "filename contains the variable myvar"
}
else {
display "filename does not contain the variable myvar"
}

Here's an example from the built-in auto dataset (I changed paths to the system path just to make the example filename shorter):



. describe using auto.dta, varlist

Contains data 1978 Automobile Data
obs: 74 13 Apr 2009 17:45
vars: 12
size: 3,478
-------------------------------------------------------------------------------
storage display value
variable name type format label variable label
--------------------------------------------------------------------------------------------------------------
make str18 %-18s Make and Model
price int %8.0gc Price
mpg int %8.0g Mileage (mpg)
rep78 int %8.0g Repair Record 1978
headroom float %6.1f Headroom (in.)
trunk int %8.0g Trunk space (cu. ft.)
weight int %8.0gc Weight (lbs.)
length int %8.0g Length (in.)
turn int %8.0g Turn Circle (ft.)
displacement int %8.0g Displacement (cu. in.)
gear_ratio float %6.2f Gear Ratio
foreign byte %8.0g origin Car type
-------------------------------------------------------------------------------
Sorted by: foreign

. if strpos(r(varlist), "price")!=0 {
. display "auto.dta contains the variable price"
auto.dta contains the variable price
. }

. else {
. display "auto.dta does not contain the variable price"
. }

. if strpos(r(varlist), "fakevar")!=0 {
. display "auto.dta contains the variable fakevar"
. }

. else {
. display "auto.dta does not contain the variable fakevar"
auto.dta does not contain the variable fakevar
. }

.


Of course the examples are ugly but they're just to show how to use -describe- and r(varlist) to determine if a variable's present. You could probably combine the gist of the above code with a list of file names and some binary indicators...

Good luck!

bukharin
01-25-2011, 01:32 AM
Actually it looks like you just want a list of variables for each dataset. That's easier. Just create a Stata dataset with a string variable called "filename" containing, in each row, a file name.

Then run the following code:


gen str vars=""
levelsof filename, local(datafiles)
foreach fname of local datafiles {
quietly describe using `fname', varlist
quietly replace vars=r(varlist) if filename=="`fname'"
}