Principal Component Analysis in R Help

#1
I am a beginner to R. I have read several guides, but still am stuck on this:

I have data in an excel csv file, on which I want to run PCA.
I'm not sure how the prcomp formula works. The help page states:
prcomp(x, retx = TRUE, center = TRUE, scale. = FALSE,
tol = NULL, ...)

what is x referring to? I tried putting the file name for x, but i get the following error:
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric

what kind of numeric value do I need to put in for x?

Potentially helpful information: my data sheet has around 48 columns and over 7000 rows. I have converted the csv file into a matrix in R.

Thanks in advance for all your help
 

bugman

Super Moderator
#2
I am a beginner to R. I have read several guides, but still am stuck on this:

I have data in an excel csv file, on which I want to run PCA.
I'm not sure how the prcomp formula works. The help page states:
prcomp(x, retx = TRUE, center = TRUE, scale. = FALSE,
tol = NULL, ...)

what is x referring to? I tried putting the file name for x, but i get the following error:
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric

what kind of numeric value do I need to put in for x?

Potentially helpful information: my data sheet has around 48 columns and over 7000 rows. I have converted the csv file into a matrix in R.

Thanks in advance for all your help
x is the name of your dataframe or matrix (i.e. the name you have given to your file.
 
#3
I put the name of my matrix, but i got the error:
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric

Not sure what to do with this.
 

Lazar

Phineas Packard
#5
This would suggest that not all of the variables in your matrix are numeric. You most likely only want to provide a subset of variables (I guess). So you will need to pass the principal components function a subsetted dataset. For example say you have a dataset called myData and you want to do PCA on variables 3 to 100 you can use
Code:
prcomp(myData[,3:100])
With subsetting data.frames and matrices row selection goes on the left side of a comma in the square brackets and columns on the right i.e., myData[rows,columns] so myData[,3:100] is saying take all rows but only columns 3 to 100.
 
#7
All variables are numeric- I have checked.
My data sheet does have row names, however. Is it possible that it is reading them as data? If so, how can I go about this?
 

Lazar

Phineas Packard
#8
OK. Pretty sure you have not read your data into R correctly. Can you provide your whole script given that:
Code:
>head(inferno)
[,1]
[1,] "Genes.csv"
Is most certainly not what you want. My guess is that you did:
Code:
inferno <- "Genes.csv
NOPE

You want:
Code:
inferno <- read.csv("Genes.csv")
 
#9
Thanks! that seemed to be my error. However, the prcomp function still yields an error stating that "x" needs to be numeric.

After inputting what you suggested, this is part of the output (it was very long)

Gene.Name X0.min X2.min X3.min
1 78SDA 0 0.07768191 0.3793334
2 SDFK 0 0.77090604 1.7159830
3 SF56 0 0.00000000 0.0000000
4 89SFA 0 0.00000000 0.0000000
5 AFJK2 0 0.00000000 0.0000000
6 SUP23 0 0.00000000 0.0000000
 

Lazar

Phineas Packard
#10
Well not all of the variables are numeric above. 78SDA is not numeric it is a character string. See my first post on how to subset only the variables you need
 
#11
This is what i get when I omit the first column (which contains the gene names):

> prcomp(infernos[,2:49])
Error in infernos[, 2:49] : subscript out of bounds
 

Lazar

Phineas Packard
#14
when you type in

Code:
str(inferno)#str means show me the structure of an object
and

Code:
dim(inferno)#dim means give me the dimensions of an object. Rows first then columns
 
#15
The dimensions of my data, according to dim(inferno), are 7000 by 48.
I deleted the gene name column, so that my data would not contain any characters besides the header, which i set as TRUE.

I am still getting the error: Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
when I run prcomp :/
 

Lazar

Phineas Packard
#18
Surely that is not all the output? For example:
Code:
> str(iris)
'data.frame':	150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
See after $Sepal.Length: there is a value 'num'. That means that variable is numeric.

BUT for $ Species: is says 'Factor' so this variable is not numeric.
 
#19
> str(inferno)
'data.frame': 7000 obs. of 48 variables:
$ X0.min : int 0 0 0 0 0 0 0 0 0 0 ...
$ X2.min : num 0.0777 0.7709 0 0 0 ...
$ X3.min : num 0.379 1.716 0 0 0 ...
$ X4.min : num 0 1.79 0 0 0 ...

The rest of the variables are all "num". Only the first one is "int".