Conversion of data from numerical to groups for Chi2 test in R Cmdr - PLEASE help

#1
Hi :)
I need to make a chi2 test to see if the frequencies of specific genotypes are associated with BMI (body mass index). I believe that I need to perform a Chi2 test, where BMI is divided into groups (25-30 = overweight, 30< = obese). Is that true?

My problem is that I don't know how to convert my nummerical BMI data into these two groups and I can't figure out how to do this.
If someone could help I would be really grateful, preferably in R Commander, but regular R is also okay :)

Thank you!
 

bryangoodrich

Probably A Mammal
#2
The easiest way to convert numerical to factor data types is to use the "cut" function. See "?cut" for details, but it is pretty intuitive. You want to cut your numerical data based on some break points and possibly provide meaningful labels for the factors. The breaks are a vector specifying the start and end points of your bin. See help for details about which side of the bin is open or closed (i.e., do you want a left open or right open interval: [a, b) or (a, b]). For three bins you will need to specify four break points. Think about that if you don't get it. You can also specify an integer for how many bins you want and let R decide how to break up your numeric variable. If it can, it will do it evenly, but that isn't always the case. See the examples below.

Code:
(x <- 1:10)
#   [1]  1  2  3  4  5  6  7  8  9 10

cut(x, c(0, 4, 8, 12), LETTERS[1:3])
#   [1] A A A A B B B B C C
#  Levels: A B C

cut(x, c(0, 4, 8, 12))
#   [1] (0,4]  (0,4]  (0,4]  (0,4]  (4,8]  (4,8]  (4,8]  (4,8]  (8,12] (8,12]
#  Levels: (0,4] (4,8] (8,12]

cut(x, 3)
#   [1] (0.991,4] (0.991,4] (0.991,4] (4,7]     (4,7]     (4,7]     (4,7]     (7,10]   
#   [9] (7,10]    (7,10]   
#  Levels: (0.991,4] (4,7] (7,10]

cut(x, 2, c("Low", "High"))
#   [1] Low  Low  Low  Low  Low  High High High High High
#  Levels: Low High

data.frame(x, y = cut(x, 5, LETTERS[1:5]))
#      x y
#  1   1 A
#  2   2 A
#  3   3 B
#  4   4 B
#  5   5 C
#  6   6 C
#  7   7 D
#  8   8 D
#  9   9 E
#  10 10 E
 
#3
Thank you, this is very helpful.

How do I specify my variable as x? I have tried writing (x <- BMI1) since BMI1 is the name of the variable that I wish divided into groups, but R says Error: object "BMI1" not found?

I wish to divide my data into two groups: 25-30 and >30. What is the endpoint of my last bin supposed to be? cut(x, c(25, 30, ?), LETTERS[1:2])

Thanks again :)
 

bryangoodrich

Probably A Mammal
#4
You can check what variables exist in your current working environment with the "ls" function. If you start a new session, you will not have those from a previous session unless you saved your last session. You may need to read your data back in or check its spelling (case matters in R). You should know the range of your data, and you can check this in a number of ways: "summary" and "range" are two good methods. The "max" and "min" functions also come to mine. Suppose BMI1 is a vector of randomly chosen values between 10 and 40. Then you can do something like the following with it:

Code:
(BMI1 = sample(10:40, 10, replace = TRUE))
#  [1] 24 32 19 29 30 12 23 30 33 16

summary(BMI1)
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#    12.0    20.0    26.5    24.8    30.0    33.0 

range(BMI1)
# [1] 12 33

max(BMI1)
# [1] 33
 data.frame(BMI1, Bins = cut(BMI1, c(25, 30, max(BMI1)), right = FALSE))
   BMI1    Bins
1    24    <NA>
2    32 [30,33)
3    19    <NA>
4    29 [25,30)
5    30 [30,33)
6    12    <NA>
7    23    <NA>
8    30 [30,33)
9    33    <NA>
data.frame(BMI1, Bins = cut(BMI1, c(25, 30, max(BMI1)), right = FALSE))
#    BMI1    Bins
# 1    24    <NA>
# 2    32 [30,33)
# 3    19    <NA>
# 4    29 [25,30)
# 5    30 [30,33)
# 6    12    <NA>
# 7    23    <NA>
# 8    30 [30,33)
# 9    33    <NA>
# 10   16    <NA>
## Some items not grouped. Instead, do this:

data.frame(BMI1, Bins = cut(BMI1, c(0, 25, 30, Inf), labels = c("Excluded", "Normal", "Obese"), right = FALSE))
#    BMI1     Bins
# 1    24 Excluded
# 2    32    Obese
# 3    19 Excluded
# 4    29   Normal
# 5    30   Obese
# 6    12 Excluded
# 7    23 Excluded
# 8    30   Obese
# 9    33    Obese
# 10   16 Excluded
Notice how R has definitions for "infinity." We can make use of that to specify an open range like "> 30." You also want to give the values in your dataset some label if they aren't included in the range. You never want an NA label. I specified those that fall under 24 to just be excluded. In this way, you can subset your data afterward to exclude those values from your dataset. Notice also that I included the parameter "right = FALSE." Try it with and without that label and notice the difference. Notice the default is "right = TRUE" which produces intervals like (a, b]. We don't want our exclusion range to be (0, 25], we want it to be [0, 25). That is why you must include the parameter. Also notice that the max value wasn't included since our ranges were right-open. You can, instead, do something like "max(BMI1)+1". That "+1" buffer will make sure that values of 33 get included.
 
#5
Thank you for your help, this is very nice of you.
My whole dataset is namet Louise2 and contains these variables:
> ls(Louise2)
[1] "Age" "BMI1" "BMI2"
[4] "Family" "Gender" "Height"
[7] "Partner" "rs10521303" "rs10521304"
[10] "rs11872992" "rs1477196" "rs17817288"
[13] "rs8093815" "rs9939609" "Weight_Day1"
[16] "Weight_Day2" "Weight_Loss_Kg" "Weight_Loss_Percent"

I wish to divide only the BMI1 variable into groups, but R won't recognize it, only the whole dataset. How do I tell R that it is supposed to load and change only the BMI1 variable?
I have tried writing data.frame(Louise2, Bins = cut(Louise2, c(25, 30, max(Louise2)), right = FALSE)), but since there are non-numerical variables in my dataset this doesn't work.
 

bryangoodrich

Probably A Mammal
#6
The 'cut' function only works on numerical data. It makes no sense to cut the set {1, 2, 3, 'A', 4, 5} into (0, 3]and (3, 5], you know? Also, if all those variables are within a dataset you can do a number of things. I suggest reading the "Introduction to R" help manual (see Help > Manuals > ...). You can call it by name: df$var. You can call it by index: df[, "var"] or (for dataframes only) df[["var"]]. You can call it by position: df[, 3] or df[[3]]. You can use the 'with' function: with(df, cut(...)). The 'with' function is basically a short-hand for using 'attach' such as "attach(df) ... cut(var, ...) ... detach(df)". If it's a single command you need access to a variable for, I would not use attach, though. Now, you need to replace or create a new variable with your cut. Say, df$BMI <- cut(BMI, ...) or you can create a new variable: e.g., transform(df, gBMI = cut(BMI, ...)).

Hope those help.