subsetting data with criteria involving multiple variables/combinations of variables

#1
Hi,

I am trying to create a subset of data of “high intermediate risk” group of entries as defined by a criteria based on age and number of risk factors.

It is:
Age over 70 plus at least 1 risk factor
Age 50-70 plus at least 2 risk factors
Age under 50 plus all three risk factors

The risk factors are:
1. Having a “Grade” of 2-3
2. Presence of invasion
3. Tumor extension is more than 50%


Age (variable: “AGE”; code 001 = age 1, etc)
Grade (variable: “GRADE”; code 2 = grade 2, 3 = grade 3)
Invasive (variable: “INVASION” where code 1 = invasion is present)
Tumor Extension more than 50% (variable: CS_EXTENSION where codes 130, 133, and 135 all mean more than 50% tumor extension criteria is met)

Can anyone help to create a subset that would "capture" all of these entries in a data set? I have not been having any luck so far.

Thanks so much,
kungfupanda
 

hlsmith

Less is more. Stay pure. Stay poor.
#2
Re: subsetting data with criteria involving multiple variables/combinations of variab

I sucked at data steps, my recommendation would be the longest "where" step ever (dragon warrior size) .
 
#3
Re: subsetting data with criteria involving multiple variables/combinations of variab

Try this:
Code:
data HiRisk;
	drop NumRiskFactors;
	set All;
	NumRiskFactors = 0;
	if (GRADE in (2, 3)) then
		NumRiskFactors = NumRiskFactors+1;
	if (INVASION = 1) then
		NumRiskFactors = NumRiskFactors+1;
	if (CS_EXTENSION in (130, 133, 135)) then
		NumRiskFactors = NumRiskFactors+1;
	if (((AGE >= 70) & (NumRiskFactors >= 1))
		| ((50 <= AGE < 70) & (NumRiskFactors >= 2))
		| ((AGE < 50) & (NumRiskFactors >= 3)));
run;
The above code first counts the number of risk factors present and then keeps only those observations where the combination of age and number of risk factors meets one of the high risk criteria, outputting the results to dataset “HiRisk”.

There are more compact ways of doing the above but I have used a more verbose approach that is easier to follow. If you want to know how many risk factors are present, just remove the “drop NumRiskFactors;” directive from the above code.
 
Last edited:
#4
Re: subsetting data with criteria involving multiple variables/combinations of variab

Thank-you very much! That worked perfectly and was easy for me to follow.
 
#5
Re: subsetting data with criteria involving multiple variables/combinations of variab

Hi,

This is a simpler variation on my first question on this thread:

I have another dataset I'm trying to create that is based on a combination of variables and I have so far been unsuccessful in creating a syntax to select for what I want to include.
I want the following patients:
STAGE "1" and GRADE 3
STAGE "1a" and GRADE 3
STAGE "1b" and GRADE 2-3
STAGE "1c" and GRADE 1-3 (all grades)

so far I've tried
data subset
set alldata
if STAGE = "1" & GRADE = 3;
if STAGE = "1A" & GRADE = 3;
if STAGE = "1B" & GRADE >= 2;
if STAGE = "1C";
run;

and have gotten errors. Am I going about this wrong- should I be deleting instead, or making multiple smaller datasets and merging? I thought the best way would be to select for the patients I want but so far have not gotten this to work in SAS.

Thanks!
 
#6
Re: subsetting data with criteria involving multiple variables/combinations of variab

think I may have answered my own question- need OR statements in the data step right?

apologies.
 
#7
Re: subsetting data with criteria involving multiple variables/combinations of variab

Yes, you’ll need to separate the individual “if” conditions with an “or”. As it stands, the four sequential “if” conditions combine as though they were joined by “and”, and are therefore mutually exclusive: It is logically impossible to have an observation where both (STAGE = "1A") and (STAGE = "1B") are true. Ergo, the construct you seek is:
Code:
	if (((STAGE = "1") & (GRADE = 3))
		| ((STAGE = "1A") & (GRADE = 3));
		| ((STAGE = "1B") & (GRADE >= 2));
		| (STAGE = "1C"));
 

hlsmith

Less is more. Stay pure. Stay poor.
#8
Re: subsetting data with criteria involving multiple variables/combinations of variab

Con-Tester,


Can you talk about using the pipe here ("|")? Is it just functioning as an "OR"? I have not used it much in SAS, say just in interaction terms with an @2.
 
#9
Re: subsetting data with criteria involving multiple variables/combinations of variab

The grammar, syntax and vocabulary of SAS’s scripting language is a hodgepodge of elements and conventions taken from various computer programming languages, chiefly FORTRAN, BASIC, C and Pascal. The pipe character “|” functions as a logical “or” operator, the circumflex “^” as a logical “exclusive or (xor)”, the tilde “~” as a logical “not”, and the ampersand “&” as a logical “and”. This notation was taken from the C language but in each case one could write the operators out in alphabetic characters, i.e. “or”, “and”, “xor” and “not”, to achieve the same thing, just as there are word-like mnemonics in SAS for comparison operators, e.g., “ne” for “~=” (not equal to) or “ge” for “>=” (greater or equal than), etc. These comparison mnemonics were taken from the FORTRAN language.

Personally, I have a preference for the shorter notation that uses the non-alphabetic symbols. It improves readability for me over the mnemonics because operators are immediately distinguishable from variables and function names. It may also be because I spent several years working as a programmer and the notations are more familiar.

Oh, and using two pipe characters together “||” acts as a string concatenation operator.
 

hlsmith

Less is more. Stay pure. Stay poor.
#10
Re: subsetting data with criteria involving multiple variables/combinations of variab

Con-Tester,


I know it is not propoer etiquette to hijack a thread, but I have a quick question and I know you are good at SAS data management code.


I simulated then bootstrapped some data. Now I have one long data set with say 10 bootstraps (replicates) in it. I want to collapse it down to counts per bootstrap sample. So say I have three variables:


Replicate (which subsample of bootstrap 1-10)
Exposure (binary either 1 or 0)
Outcome (binary either 1 or 0)


Each subsample has 100 observations and with 10 bootstrap samples total resulting in a 1000 observations. I want this collapsed into say:


Replicate exp0_out0 exp1_out0 exp0_out1 exp1_out1
1 25 25 20 30
2 30 20 22 28
.
.
.
.
.
.
.
.
10 5 50 15 30


Do you have any recommendations?
 
#11
Re: subsetting data with criteria involving multiple variables/combinations of variab

There may be a single procedure in SAS that will do what you require but I’m not aware of any such. There are several ways to solve the problem, but here’s how I would do it:
Code:
proc	freq data = SampleData noprint;
	tables	Replicate*Exposure*Outcome
		/out = f0 (drop = PERCENT);
run;

data	f1 (rename = (COUNT = exp0_out0))
	f2 (rename = (COUNT = exp1_out0))
	f3 (rename = (COUNT = exp0_out1))
	f4 (rename = (COUNT = exp1_out1));
	set	f0;
	keep	Replicate
		COUNT;
	label	COUNT = " ";
	if ((Exposure = 0) & (Outcome = 0)) then
		output f1;
	else if ((Exposure = 1) & (Outcome = 0)) then
		output f2;
	else if ((Exposure = 0) & (Outcome = 1)) then
		output f3;
	else if ((Exposure = 1) & (Outcome = 1)) then
 		output f4;
run;

data MyFreqTable;
	merge	f1-f4;
	by	Replicate;
run;

proc delete data = f0 f1 f2 f3 f4;
run;
This approach works fine where discrete, categorical and binary variables are in play and the number of different values for these is small.

EDIT: …or you could simply use this if you just want a printout:
Code:
proc	tabulate data = SampleData;
	class	Replicate
		Exposure
		Outcome;
	tables	Replicate,
		Exposure*Outcome;
run;
 
Last edited: