Need ideas to modernize using SAS my data cleaning approach

mht

New Member
#1
Hi,
I have been given various large datasets that require a lot of time and effort cleaning incorrect codes (from a scanner that did not read the data correctly). I have completed the first dataset and feel that the route that I chose to get the corrections made were much too manual and there must be a better way to do this using SAS. Here's what I have done:

1. Ran freq distributions on all variables in dataset;
2. Highlighted with yellow pen all problem codes in every variable;
3. Ran proc print for studyid numbers based on each problem code by every variable;
4. Manually entered all sorted by studyid numbers and problem codes into new Excel document;
5. Manually placed line dividers between studyid numbers in Excel document for ease when in paper files;
6. Printed Excel document;
7. Went to paper questionnaire files to determine correct codes and noted on paper Excel document;
8. Manually entered into Excel document the correct code (new variable/new column);
9. Programmed in SAS the correction lines for each studyid number using if/then/do commands;
10. Re-ran freq distributions on all variables to ensure all problem codes were taken care of.

Step #7 is a manual, labor-intensive, time-consuming task. Just trying to get some ideas on how to automate this process in a better way. Thanks & sorry if this was placed in the wrong group. I work completely in SAS (and Excel).
 

mht

New Member
#3
No, not really. Most are dates of first usage/onset and the year is off by a digit or so. No real patterns noted, unfortunately.
 

hlsmith

Less is more. Stay pure. Stay poor.
#4
Excellent data management skills are completely under-appreciated. If you can find a way to do this in SAS then your actions will be saved in SAS code and you won't have to remember what you did or do it again in another dataset.

Do you have to note in Excel, couldn't you change them in SAS (save code), then export dataset as an updated version into an Excel file.
 

noetsi

Fortran must die
#5
To start with a really dumb comment perhaps. If you know this is being caused by a faulty scanner, it seems like fixing the scanners would be the solution. Obviously that would only help in the future.

How exactly do you spot the errors - that is what gives them away. My guess is that the way to deal with this would be to create a macro (although I am not good enough at macros to help there). For this to work the macro would have to know what to look for.