Averaging rows from a matrix

#1
I'm fairly new to using R, and wonder if someone could give me some help subsetting large data matrices in specific ways.

I've attached a data file. It measures the percent of light reflected off of plants at 2048 different wavelengths. The first column is wavelength, which ranges from 339.99 to 1024.14 nm, by approximately 0.3 nm steps. The first row describes the field plot that was sampled, which are mostly numbers but sometimes characters (WB for white board, which is a control).

This sample data file has 2048 rows (wavelengths) by 10 columns (field plots), plus a column and row with labels. The real data files have several hundred columns (field plots).

I am interested in seeing how the values of certain wavelengths differ between samples (columns). For example, the parameter called the Water Index (WI) describes the drought tolerance of different plants. Water Index = reflectance at 970 nm / reflectance at 900 nm.

I would like to be able to ask R to do things like:
- Average the ____ values closest to ____ nm for each sample, and save this as a vector.
- Average the values of all rows between ____ and ____ nm, and save this as a vector.

Specifically this would be
- Average the 3 values (rows) closest to 970 nm and save this as a vector called 970_avg3.
- Average the 5 values (rows) closest to 970 nm and save this as a vector called 970_avg5.
- Average the 10 values (rows) closest to 970 nm and save this as a vector called 970_avg10.

But also I would want to be able to look at discrete ranges, like
- Average all rows that are >969 & <971 nm, and save this as a vector called 970

What's most important though, is that I can change the wavelength that it's calling for pretty easily, so instead of looking for 970 nm I could just as easily ask for 500, 700, 1025, or any other number.

Thanks advance for your help.

Sarah
 

Jake

Cookie Scientist
#2
Assume your data frame is called "dat":
Code:
sapply(dat, function(x, nRows=5, nm=970) mean(x[rank((x-nm)^2) <= nRows]))
sapply(dat, function(x, low=969, high=971) mean(x[x > low & x < high]))
P.S. the names you suggested for the vectors are illegal in R (object names cannot start with numbers).
 
#3
Thanks for the code, but I have some problems using it:

In this data set, the values >520 and <522 are in rows #490-495. You can see these values are about 5-10% for the different samples

# Wavelength WB X1 X22 X23 X44 X45 X66 X67 X88 X89
490 520.0646 100.09985 7.036944 8.104620 5.582466 6.486475 6.695148 7.103472 7.408403 7.291578 6.781109
491 520.4239 100.08833 7.059483 8.120658 5.600673 6.513491 6.733979 7.120234 7.449979 7.340751 6.814166
492 520.7831 100.17361 7.101332 8.163069 5.637773 6.563469 6.770562 7.160970 7.515217 7.398073 6.871015
493 521.1423 100.14723 7.135492 8.207646 5.667622 6.580534 6.789372 7.185225 7.565259 7.425673 6.913337
494 521.5014 100.10268 7.184171 8.255644 5.717025 6.614362 6.824372 7.234381 7.605078 7.447946 6.953581
495 521.8605 100.02624 7.224637 8.293369 5.764591 6.666955 6.876897 7.281836 7.641654 7.473445 7.002897

Here, I am trying to look at the average value for all rows >520 and <522:

sapply(dat, function(x, low=520, high=522) mean(x[x > low & x < high]))

But, I get NaN's instead of an average:
Wavelength WB X1 X22 X23 X44 X45 X66 X67 X88 X89
520.9627 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

When I try the other bit of code, to average the 5 rows nearest 520 nm
sapply(dat, function(x, nRows=5, Wavelength=520) mean(x[rank((x-Wavelength)^2) <= nRows]))

Wavelength WB X1 X22 X23 X44 X45 X66 X67 X88 X89
520.0646 215.5135 535.5746 550.4109 431.5777 512.0359 501.6085 444.1466 452.9682 540.3812 497.8203

I also get the same results when I try:
sapply(dat, function(Wavelength, low=520, high=522) mean(Wavelength[Wavelength > low & Wavelength < high]))


These values aren't anywhere near the 5-10% I know they should be.

Any ideas? They are greatly appreciated.
 
Last edited:

Jake

Cookie Scientist
#4
The functions both return averages for each column. From what I saw, you didn't say anything about how you might want to combine these column-wise averages? So they are not combined. Honestly when you say "Average the ____ values closest to ____ nm for each sample, and save this as a vector," it is not totally clear what values in the matrix you want us to average over. I assumed it was the values in each column.

So in the first case, only column 1 contains any values between 520 and 522, so only column 1 returns a non-NaN value; the rest are NaNs. This is exactly what we expect to happen given the snippet of data you showed me. For each column, we can't compute the average of the rows with values between 520 and 522 if that column doesn't have any such rows.

The second example also appears to be working exactly as expected, or at least exactly as I expected. You don't actually say what is wrong in this case except for something about something not being 5%-10% different from something else (???). At the top of your post when you show some example rows and say "You can see these values are about 5-10% for the different samples," I have no idea what you're talking about here. I see a whole ton of numbers and it's not clear what things you are saying are 5%-10% different from what other things.

In your third example, all you did was take the code from the first example and renamed the "x" variable to "Wavelength." It is entirely unclear why you think this should change the output in any way...

If these functions are not behaving the way you thought they should, then clearly there has been a misunderstanding. Perhaps you can clarify your expectations. Maybe you can give an example on a much smaller example dataset where you show the input data, compute the expected results by hand, and then show this output to us, this way we can clearly see the expected mapping between inputs and outputs.

P.S. I do appreciate your attempt to provide reproducible example data, however I did not and will not download a random Excel file posted on an internet forum... perhaps you can use dput() to post any example datasets (LINK). And if you do, make them much smaller! There is no need for providing an unreadably large dataset just for showing what computations you want done.

EDIT

I see that you created another thread in the R forum describing the same exact problem that you have here. Please don't needlessly duplicate threads... you can continue describing your problem in the same thread.

Your post in the other thread makes it clear that you want to compute means of rows for each column based only on the values found in the first column (not based on the values in each particular column). Okay. Here are slight modifications of the previous code (untested) that should do this:
Code:
sapply(dat[,-1], function(x, nRows=5, nm=970) mean(x[rank((dat[,1]-nm)^2) <= nRows]))
sapply(dat[,-1], function(x, low=969, high=971) mean(x[dat[,1] > low & dat[,1] < high]))
 
#5
I'm really disappointed in this accusatory response. I thought this was going to be a supportive, encouraging forum for new users and was really excited yesterday to see that someone was trying to help me out. I'm pursuing a graduate degree in plant genetics, not statistics, and am trained in different software. I feel ashamed that I spoke so highly of this forum to my colleagues today, and hope I can find another resource where users exercise better etiquette in the future.
 

Jake

Cookie Scientist
#6
I guess you missed the part of the thread where a random stranger on the Web provided you with a solution to your highly specific problem, within a day of your asking, for free. You're welcome.
 

trinker

ggplot2orBust
#8
Yeah don't read into text based messages like you do a conversation in person. You're missing all the other modalities of communication and reading into text just leads to, well, disappointment I guess. Instead focus on the problem and take the free advice that's given, and do you best to get better.

Be questioned and challenged is great for learning. Jake's questioned me in similar fashion several times. It's not an attack it's an intellectual challenge that, if taken up, will cause you to grow. I promise you no one puts that much time into a response (or in this case Jake's multiple responses) if they want you to fail at what you're doing.

Because modality, embodiement, communication and emotion are part of my research agenda here's a related paper on reading emotion into emails: http://www.socsci.uci.edu/ssarc/internship/webdocs/session03/02-ByronArticle.pdf

sarahgrogan said:
I'm pursuing a graduate degree in plant genetics, not statistics, and am trained in different software.
I'm a humble elementary teacher pursuing a degree in literacy and have too been exposed to stats and R. Both are difficult but have been amazing tools in my work around studying student learning. In my field, and more so yours, statistics and the content are inseparable.

So let's put this behind us, learn from it and move forward. We're happy to have you here and your expertise in your field. it makes us more diverse as a learning community. People here want to help, they want to teach and learn from others. Please continue to ask questions and perhaps answer a few people's questions as well.
 
#9
I'm really disappointed in this accusatory response. I thought this was going to be a supportive, encouraging forum for new users and was really excited yesterday to see that someone was trying to help me out. I'm pursuing a graduate degree in plant genetics, not statistics, and am trained in different software. I feel ashamed that I spoke so highly of this forum to my colleagues today, and hope I can find another resource where users exercise better etiquette in the future.
Jake's comments are pure gold. He puts 2, 3 hours for many of his comments. If I were you, I would click the Thanks button at the lower-left corner of each of his comments to officially thank him for the time he spent for me. You were lucky today for the very good help you received... for free. :)

[temp post Dason] :D
 

Dason

Ambassador to the humans
#10
Jake's comments are pure gold. He puts 2, 3 hours for many of his comments. If I were you, I would click the Thanks button at the lower-left corner of each of his comments to officially thank him for the time he spent for me. You were lucky today for the very good help you received... for free. :)

[temp post Dason] :D
Now it's a forever post :D