View Full Version : Stata strategy needed: which datasets have variable X?
Hello everyone,
I'm working with a large number of large datasets (roughly 150), and basically I need some potential strategies for automating having Stata figure out whether or not a certain set of variables exist across these datasets.
Basically, I want to write a .do (or .ado) file which opens each dataset, looks for a list of variables, and writes a 0 or 1 to a spreadsheet letting me know which variables exist in which datasets. It's a simple enough SOUNDING idea, but I've no clue how to actually do something like this in practice or if it's even possible.
Any help would be greatly appreciated!
bukharin
01-25-2011, 01:22 AM
One command that may be useful is:
describe using filename, varlist
This will describe the contents of "filename", and a list of the variables it contains will be contained in the return value r(varlist)
So you could then do, for an ugly example:
describe using filename, varlist
if strpos(r(varlist), "myvar")!=0 {
display "filename contains the variable myvar"
}
else {
display "filename does not contain the variable myvar"
}
Here's an example from the built-in auto dataset (I changed paths to the system path just to make the example filename shorter):
. describe using auto.dta, varlist
Contains data 1978 Automobile Data
obs: 74 13 Apr 2009 17:45
vars: 12
size: 3,478
-------------------------------------------------------------------------------
storage display value
variable name type format label variable label
--------------------------------------------------------------------------------------------------------------
make str18 %-18s Make and Model
price int %8.0gc Price
mpg int %8.0g Mileage (mpg)
rep78 int %8.0g Repair Record 1978
headroom float %6.1f Headroom (in.)
trunk int %8.0g Trunk space (cu. ft.)
weight int %8.0gc Weight (lbs.)
length int %8.0g Length (in.)
turn int %8.0g Turn Circle (ft.)
displacement int %8.0g Displacement (cu. in.)
gear_ratio float %6.2f Gear Ratio
foreign byte %8.0g origin Car type
-------------------------------------------------------------------------------
Sorted by: foreign
. if strpos(r(varlist), "price")!=0 {
. display "auto.dta contains the variable price"
auto.dta contains the variable price
. }
. else {
. display "auto.dta does not contain the variable price"
. }
. if strpos(r(varlist), "fakevar")!=0 {
. display "auto.dta contains the variable fakevar"
. }
. else {
. display "auto.dta does not contain the variable fakevar"
auto.dta does not contain the variable fakevar
. }
.
Of course the examples are ugly but they're just to show how to use -describe- and r(varlist) to determine if a variable's present. You could probably combine the gist of the above code with a list of file names and some binary indicators...
Good luck!
bukharin
01-25-2011, 01:32 AM
Actually it looks like you just want a list of variables for each dataset. That's easier. Just create a Stata dataset with a string variable called "filename" containing, in each row, a file name.
Then run the following code:
gen str vars=""
levelsof filename, local(datafiles)
foreach fname of local datafiles {
quietly describe using `fname', varlist
quietly replace vars=r(varlist) if filename=="`fname'"
}
Powered by vBulletin™ Version 4.1.3 Copyright © 2013 vBulletin Solutions, Inc. All rights reserved.