gen variable out of one multiple answer question

belfagor71

New Member
I'll try to explain what my problem is.
I am cleaning data and I have one question where multiple ticks are allowed.
The questions I am dealing now is about way in which water is purified.
There are 6 variables and 1 of these is "I don't do anything to purify the water", the other 6 being " I do purify water in this way...".

Since there is a higher number of people who do not purify the water and this is relevant information, I would like to create one variable where 1 stands for "I do not purify the water" and 2 " I do purify the water".

However, since multiple ticks were allowed, If i simply do

gen x = .
replace x = 1 if (pur_a == 1 | pur_b == 2 | pur_c == 3 )
The frequency that I get is actually lower than the number of answers. So basically what Stata does is assigning only ine answer per person and not more than one. This is why I get a lower frequency.

Is there a way to work this out and get the actual frequency?

I hope I made the point clear...it is a bit of a headache indeed! :yup:

hope to hear from you guys!

belfagor

belfagor71

New Member
ok, i'll copy something here. let me know if it is enough or if you need more.

How many |
women |
purify |
drink.water |
by boiling |
it? | Freq. Percent Cum.
------------+-----------------------------------
1 | 336 100.00 100.00
------------+-----------------------------------
Total | 336 100.00

How many |
women |
purify |
drink.water |
through |
chlorine? | Freq. Percent Cum.
------------+-----------------------------------
2 | 21 100.00 100.00
------------+-----------------------------------
Total | 21 100.00

And so on. So if I tab the freq for every single of the 5 variables that I have and if I sum them up I get is 960.
However, if I do:
gen x =.

replace x = 1 if (pur_boiling == 1 | pur_chlorine == 2 | pur_filter == 3 | pur_water_ftr == 4 | pur_other == 6)

replace x = 2 if pur_nothing == 5

. tab x

x | Freq. Percent Cum.
------------+-----------------------------------
1 | 839 33.82 33.82
2 | 1,642 66.18 100.00
------------+-----------------------------------
Total | 2,481 100.00

The number that I got for x = 1 is 839, that is lower, than 960.
The thing is, I want to show the % of people (the majority) who do not purify the water in comparison with the number of people who do purify their water.
Because of that it would be important to have the precise figure and not one which is lower.
However since the question, as it was put in the survey, allowed for multiple ticks, I guess that what Stata does is consider only one answer per respondent.
Is there a way to sort this out and having a variable with two entries, one with the exact number of people who do purify their water and the other one with the number of people who do not purify it?

Thanks a lot!!

bukharin

RoboStataRaptor
Is there a way to sort this out and having a variable with two entries, one with the exact number of people who do purify their water and the other one with the number of people who do not purify it?
Perhaps I'm not understanding correctly, but this seems to be a simple binary outcome - either they purify their water or they don't. So I think your code is fine. You get 960 because some women use more than one method for purifying their water. You should double-check that nobody replied that they don't purify water, but also checked one or more of the purification methods.

belfagor71

New Member
You should double-check that nobody replied that they don't purify water, but also checked one or more of the purification methods.
mmmm....what do you mean exactly?

bukharin

RoboStataRaptor
Code:
gen x =.
replace x = 1 if (pur_boiling == 1 | pur_chlorine == 2 | pur_filter == 3 | pur_water_ftr == 4 | pur_other == 6)
[B]count if x==1 & pur_nothing==5[/B] // this should return 0
replace x = 2 if pur_nothing == 5

belfagor71

New Member
Ok so if I type
count if x==1 & pur_nothing==5
I get 34 which means, i guess, there are 34 people who do purify and who do not purify water at the same time, right?
So what I could do know..perhaps just remove this 34 so that I get the number who do not purify the water at all?

bukharin

RoboStataRaptor
Ok so if I type
count if x==1 & pur_nothing==5
I get 34 which means, i guess, there are 34 people who do purify and who do not purify water at the same time, right?
Correct! Welcome to the world of data cleaning...

What to do next is entirely up to you. Best practice would be to go back to the forms and make sure the data's been entered into the computer correctly. Assuming they've been entered correctly, you could ignore those respondents (as you've suggested) or you could assume that if they specified any purifying technique, they should have an x of 1 (you could implement that by swapping the order of your two replace commands).

belfagor71

New Member
Assuming they've been entered correctly
Yes, they did as the survey itself did allow for multiple ticks.

Code:
you could assume that if they specified any purifying technique, they should have an x of 1 (you could implement that by swapping the order of your two replace commands).
So you mean by doing
gen x =.
replace x = 1 if (pur_boiling ~= 1 & pur_chlorine ~= 2 & pur_filter ~= 3 & pur_water_ftr ~= 4 & pur_other ~= 6) & pur_nothing == 5

Indeed, if I do:
tab pur_nothing
I get a freq of 1642
However if I type the above command, I get 1608 real changes made.
So I assume 1608 is the freq of people who so not purify water at all.
Makes sense?

After all, even if stata assign only one answer to each respondent, it is still ok for me. As long as I know that one person purifies water, it is trifling then for me to know in which way. The most important thing for me is to know how many people do not do anything at all and I guess the command above sorted that out.

...If so, I guess my issue was sorted! Thanks sooo much! It's incredible to see how helpful it is to share these questions here, bukharin

bukharin

RoboStataRaptor
Yes, they did as the survey itself did allow for multiple ticks.
That doesn't necessarily mean it's been entered into the computer correctly - but I take your point.

So you mean by doing
gen x =.
replace x = 1 if (pur_boiling ~= 1 & pur_chlorine ~= 2 & pur_filter ~= 3 & pur_water_ftr ~= 4 & pur_other ~= 6) & pur_nothing == 5
Actually I just meant that instead of:
Code:
gen x =.
replace x = 1 if (pur_boiling == 1 | pur_chlorine == 2 | pur_filter == 3 | pur_water_ftr == 4 | pur_other == 6)
replace x = 2 if pur_nothing == 5
You could simply swap the last 2 lines around like this:
Code:
gen x =.
replace x = 2 if pur_nothing == 5
replace x = 1 if (pur_boiling == 1 | pur_chlorine == 2 | pur_filter == 3 | pur_water_ftr == 4 | pur_other == 6)
This way you're giving precedence to the purifying techniques, ie if they specify any technique, this overrides the response that they don't purify.

Indeed, if I do:
tab pur_nothing
I get a freq of 1642
However if I type the above command, I get 1608 real changes made.
So I assume 1608 is the freq of people who so not purify water at all.
Makes sense?
Yes, and 1642 - 1608=34 which fits with the 34 you got from the earlier -count- command. (Always good to check these things in more than one way if possible)

...If so, I guess my issue was sorted! Thanks sooo much! It's incredible to see how helpful it is to share these questions here, bukharin
No worries

belfagor71

New Member
Code:
 gen x =.
replace x = 2 if pur_nothing == 5
replace x = 1 if (pur_boiling == 1 | pur_chlorine == 2 | pur_filter == 3 | pur_water_ftr == 4 | pur_other == 6)
Ok, thats quite important though as I did not get this before.
So by typing this as the first command I am actually telling Stata : consider only those women who do not purify water at all. So if a person ahs one tick for pur_chlorine and one tick for pur_nothing, Stata will not consider the answers of this respondent, right? It will consider only those who had only the tock for pur_nothing.

Differently, if I type this as the second command, Stata will pick up those women who do purify the water and those ones who don't.

Thats good to know this: I thought that the order in which you wrote the command did not matter at all.
But it does matter!
I admit though that a graphic example of how Stata reads and picks up the values would really help...

bukharin

RoboStataRaptor
The order definitely matters; otherwise there would be no way of making complex changes to a dataset! What the commands are doing is:

gen x =.
-> first, generate a new variable called x, with a missing value for all observations (ie in this case all women)

replace x = 2 if pur_nothing == 5
-> then, change x from "missing" to 2 if the woman ticked the box saying she didn't purify. You can also think of this command as saying "in the subset of women who ticked the box saying they didn't purify, change x to 2"

replace x = 1 if (pur_boiling == 1 | pur_chlorine == 2 | pur_filter == 3 | pur_water_ftr == 4 | pur_other == 6)
-> finally, change x from whatever it is (either missing or 2) to 1 if the woman ticked any of those other boxes. You can also think of this command as saying "in the subset of women who ticked one of these other boxes, change x to 1"

It's easy enough to concoct your own examples to see what's happening. Or, just take a small subset of your dataset. Run the commands one at a time and after each one look at the data to see if it's changed in the way you expected. This is a time-consuming but invaluable process.