Stata strategy needed: which datasets have variable X?

bgor

New Member
#1
Hello everyone,

I'm working with a large number of large datasets (roughly 150), and basically I need some potential strategies for automating having Stata figure out whether or not a certain set of variables exist across these datasets.

Basically, I want to write a .do (or .ado) file which opens each dataset, looks for a list of variables, and writes a 0 or 1 to a spreadsheet letting me know which variables exist in which datasets. It's a simple enough SOUNDING idea, but I've no clue how to actually do something like this in practice or if it's even possible.

Any help would be greatly appreciated!
 

bukharin

RoboStataRaptor
#2
One command that may be useful is:

Code:
describe using filename, varlist
This will describe the contents of "filename", and a list of the variables it contains will be contained in the return value r(varlist)

So you could then do, for an ugly example:

Code:
describe using filename, varlist
if strpos(r(varlist), "myvar")!=0 {
display "filename contains the variable myvar"
}
else {
display "filename does not contain the variable myvar"
}
Here's an example from the built-in auto dataset (I changed paths to the system path just to make the example filename shorter):

Code:
. describe using auto.dta, varlist

Contains data                                 1978 Automobile Data
  obs:            74                          13 Apr 2009 17:45
 vars:            12                          
 size:         3,478                          
-------------------------------------------------------------------------------
              storage  display     value
variable name   type   format      label      variable label
--------------------------------------------------------------------------------------------------------------
make            str18  %-18s                  Make and Model
price           int    %8.0gc                 Price
mpg             int    %8.0g                  Mileage (mpg)
rep78           int    %8.0g                  Repair Record 1978
headroom        float  %6.1f                  Headroom (in.)
trunk           int    %8.0g                  Trunk space (cu. ft.)
weight          int    %8.0gc                 Weight (lbs.)
length          int    %8.0g                  Length (in.)
turn            int    %8.0g                  Turn Circle (ft.)
displacement    int    %8.0g                  Displacement (cu. in.)
gear_ratio      float  %6.2f                  Gear Ratio
foreign         byte   %8.0g       origin     Car type
-------------------------------------------------------------------------------
Sorted by:  foreign  

. if strpos(r(varlist), "price")!=0 {
.         display "auto.dta contains the variable price"
auto.dta contains the variable price
. }

. else {
.         display "auto.dta does not contain the variable price"
. }

. if strpos(r(varlist), "fakevar")!=0 {
.         display "auto.dta contains the variable fakevar"
. }

. else {
.         display "auto.dta does not contain the variable fakevar"
auto.dta does not contain the variable fakevar
. }

.
Of course the examples are ugly but they're just to show how to use -describe- and r(varlist) to determine if a variable's present. You could probably combine the gist of the above code with a list of file names and some binary indicators...

Good luck!
 

bukharin

RoboStataRaptor
#3
Actually it looks like you just want a list of variables for each dataset. That's easier. Just create a Stata dataset with a string variable called "filename" containing, in each row, a file name.

Then run the following code:

Code:
gen str vars=""
levelsof filename, local(datafiles)
foreach fname of local datafiles {
quietly describe using `fname', varlist
quietly replace vars=r(varlist) if filename=="`fname'"
}