Organizing/ managing data analysis

yedi

New Member
#1
I have limited formal training in data analysis and have learned mostly from books/ websites etc. Now it has become very important to make my data analysis organized and standard. I use SPSS (SPSS for data entry, recoding, labeling etc) and STATA for analysis. Basically I am not a well organized person, so I always get into trouble keeping track of changes I made (new variable computed/ recoding etc). I usually keep a master data file untouched and make copies as I proceed. I name the file with study name and date (e.g CVD Jan 10, 09) and after making major changes/additions (e.g creating new variables) I make a copy of the data and name it CVD Jan 13, 2009. So by the time I publish I will have at least 10 to 20 versions of the data set and I lose track of the changes I made. I also end up with multiple do files and syntax files. I have seen people working with just one file and making changes only in syntax file. Is there a standard method for organizing analysis/ naming files/ keeping log etc.
I would really appreciate if you could share information on resources (books/ websites) or share your techniques.

Many thanks
 
#2
One definition file

I believe every well run statistics lab has its own standard as far as numbering versions is concerned. The critical thing for that is that once you develop a standard everyone who touches the files adheres to the standard.

As to the one file, yes you do need to keep one file to do all your data manipulations. Basically, you keep your raw data file(s) and start your definition control file by reading in all the appropriate raw data and merging if necessary. You record all data manipulations in that control file and end the file by reading out into a working data file. Whenever you run your definition file you save over the working data file. So, you save all the old versions of your raw data and all the old versions of your definition file, but only have one working data file. Each definition file is built by adding to the last definition file. If you want the most recent data, you simply run the latest version of the definition file which will automatically read the most recent raw data files(s), and apply all your recodes. If you ever need to go back to an old version of the data you run the appropriate old version of the definition file. It will automatically read the appropriate raw data files and re-create the old working data file. Also, if multiple people are working on the data, designate one “master keeper of the definition file”. If anyone else runs recodes or other manipulations on the data that person runs them on the working data file and sends the appropriate code to the “keeper” to add to the definition file. That way everyone is working off the same version of the recoded data and has access to everyone’s’ recodes.

Also, remember to comment everything in your definition file.
 

yedi

New Member
#3
I believe every well run statistics lab has its own standard as far as numbering versions is concerned. The critical thing for that is that once you develop a standard everyone who touches the files adheres to the standard.

As to the one file, yes you do need to keep one file to do all your data manipulations. Basically, you keep your raw data file(s) and start your definition control file by reading in all the appropriate raw data and merging if necessary. You record all data manipulations in that control file and end the file by reading out into a working data file. Whenever you run your definition file you save over the working data file. So, you save all the old versions of your raw data and all the old versions of your definition file, but only have one working data file. Each definition file is built by adding to the last definition file. If you want the most recent data, you simply run the latest version of the definition file which will automatically read the most recent raw data files(s), and apply all your recodes. If you ever need to go back to an old version of the data you run the appropriate old version of the definition file. It will automatically read the appropriate raw data files and re-create the old working data file. Also, if multiple people are working on the data, designate one “master keeper of the definition file”. If anyone else runs recodes or other manipulations on the data that person runs them on the working data file and sends the appropriate code to the “keeper” to add to the definition file. That way everyone is working off the same version of the recoded data and has access to everyone’s’ recodes.

Also, remember to comment everything in your definition file.
Thanks a lot John,
By definition file I guess you mean the spss syntax file with all the commands use to modify the data. So when I add a new variable do I append the command to the bottom of the definition file (with a note on the date and comments)?
 
#4
Syntax File

Yes,

Put the new code almost at the end. The last statement will be the same Save As that you used in the preceding definition file. I use the word definition file instead of syntax file because I also use syntax files to do analysis and I keep my analysis files separate from the definition file.