Today I Learned: ____

trinker

ggplot2orBust
Wow that is super freaking useful. How did I not know about this?

PS Post that one you have about glm in the share your code, that extends it more. (unless it words for glm too)

EDIT: It appears it works on glm too
 

trinker

ggplot2orBust
I learned that ave may be more useful than I thought. In the past I thought that ave only did mean but it can be supplied any function and doesn't aggregate but gives each observation that function score (they call it block wise). I did a poor job of explaining so here's an example:

Code:
with(warpbreaks, data.frame(tension, breaks,
    mean_breaks=ave(breaks, tension, FUN=sd)))
Which yields:
Code:
   tension breaks mean_breaks
1        L     26   16.446487
2        L     30   16.446487
3        L     54   16.446487
4        L     25   16.446487
5        L     70   16.446487
6        L     52   16.446487
7        L     51   16.446487
8        L     26   16.446487
9        L     67   16.446487
10       M     18    9.121009
11       M     21    9.121009
12       M     29    9.121009
13       M     17    9.121009
14       M     12    9.121009
15       M     18    9.121009
16       M     35    9.121009
17       M     30    9.121009
18       M     36    9.121009
19       H     36    8.352527
20       H     21    8.352527
21       H     24    8.352527
22       H     18    8.352527
23       H     10    8.352527
24       H     43    8.352527
25       H     28    8.352527
26       H     15    8.352527
27       H     26    8.352527
28       L     27   16.446487
29       L     14   16.446487
30       L     29   16.446487
31       L     19   16.446487
32       L     29   16.446487
33       L     31   16.446487
34       L     41   16.446487
35       L     20   16.446487
36       L     44   16.446487
37       M     42    9.121009
38       M     26    9.121009
39       M     19    9.121009
40       M     16    9.121009
41       M     39    9.121009
42       M     28    9.121009
43       M     21    9.121009
44       M     39    9.121009
45       M     29    9.121009
46       H     20    8.352527
47       H     21    8.352527
48       H     24    8.352527
49       H     17    8.352527
50       H     13    8.352527
51       H     15    8.352527
52       H     15    8.352527
53       H     16    8.352527
54       H     28    8.352527
 

bryangoodrich

Probably A Mammal
Looks interesting, but does it just return a plain vector as if you were to unlist a by(...) equivalent? That's what it looks like you just did to me, and the format (parameters) look identical.
 

Dason

Ambassador to the humans
But a by would only return a single row for each combination of the group factors. Ave returns as many rows as there are in each of the grouping factors. Just give it a try and you'll see what I mean.
 

bryangoodrich

Probably A Mammal
Okay, yeah. It's sort of doing an aggregation (or by) of breaks ~ tension and then merging that to the data set by tension

Code:
merge(warpbreaks, aggregate(breaks ~ tension, warpbreaks, sd), by = "tension")
Outputs

Code:
   tension breaks.x wool  breaks.y
1        H       21    A  8.352527
2        H       24    A  8.352527
3        H       18    A  8.352527
4        H       10    A  8.352527
5        H       43    A  8.352527
6        H       28    A  8.352527
7        H       15    A  8.352527
8        H       26    A  8.352527
9        H       36    A  8.352527
10       H       20    B  8.352527
11       H       21    B  8.352527
12       H       24    B  8.352527
13       H       17    B  8.352527
14       H       13    B  8.352527
15       H       15    B  8.352527
16       H       15    B  8.352527
17       H       16    B  8.352527
18       H       28    B  8.352527
19       L       26    A 16.446487
20       L       30    A 16.446487
21       L       54    A 16.446487
22       L       25    A 16.446487
23       L       70    A 16.446487
24       L       52    A 16.446487
25       L       51    A 16.446487
26       L       26    A 16.446487
27       L       67    A 16.446487
28       L       44    B 16.446487
29       L       27    B 16.446487
30       L       14    B 16.446487
31       L       29    B 16.446487
32       L       19    B 16.446487
33       L       29    B 16.446487
34       L       31    B 16.446487
35       L       41    B 16.446487
36       L       20    B 16.446487
37       M       21    A  9.121009
38       M       29    A  9.121009
39       M       17    A  9.121009
40       M       12    A  9.121009
41       M       18    A  9.121009
42       M       42    B  9.121009
43       M       26    B  9.121009
44       M       19    B  9.121009
45       M       16    B  9.121009
46       M       18    A  9.121009
47       M       35    A  9.121009
48       M       30    A  9.121009
49       M       36    A  9.121009
50       M       29    B  9.121009
51       M       39    B  9.121009
52       M       28    B  9.121009
53       M       21    B  9.121009
54       M       39    B  9.121009
 

bryangoodrich

Probably A Mammal
Seriously, I had no idea you could do this!

I was revising my ALSM project. Chapter 8 deals with qualitative predictors. The approach I used previously was terrible, largely due to my ignorance 2+ years ago. I wanted to know if there's a better way to color code and plot regression lines at different factor levels. There is!

The color coding is ridiculously easy. Previously I did two separate plots by subsetting and coloring each point overlay. That is as cumbersome as it sounds! Solution:

Code:
plot(Y ~ X1, df, col = X2)
That's it! Obviously, I understand how this works now: color will assign values based on the equally sized vector supplied. In this case, it's the factor in X2 that is either 0 or 1 and so my points are colored appropriately. If I wanted to be more clever, I could run X2 as an index on a color vector or subset thereof to choose which colors I want to select from that vector. The way I did it above gives me black and red, which is sufficient. For more complex schemes, I'd define a 'pal' (palette) variable that holds the colors I want to use, and then use X2 as an index inline. Not much different from what I initially suggested, other than I'm separating the subset of color choices from the plotting call itself.

Now, when it came to drawing the two lines for each level, I turned to abline of course. Initially I went the manually route: define abline(intercept, slope) based on the coefficients in my model. So it would be something like

Code:
coefs = coef(fit)
abline(coefs[1], coefs[2]); abline(coefs[1] + coefs[3], coefs[2])
This makes sense because the slope does not change, only the level with respect to the intercept: it's the intercept plus the factor level coefficient when that factor is present.

I wanted to know if there is a more direct way to handle this. I stumbled across this blog on a quick Google search.

The alternative to my approach above is to fit separate regressions for each of the data subsets. Then I can just do abline on each of those fitted models. It's much easier, but requires more models. Then it finishes with a way to do both models at once!

Code:
lm(Y ~ X2/X1 - 1, df)
This gives me direct intercepts and slope coefficients for each factor level in X2. The results are stupid easy to set up using the manual approach before, but now making direct reference to the relevant coefficient (i.e., no more summation is required).

Code:
# Populate the required data
df <- structure(list(Y = c(218L, 248L, 360L, 351L, 470L, 394L, 332L, 
321L, 410L, 260L, 241L, 331L, 275L, 425L, 367L, 140L, 277L, 384L, 
341L, 215L, 180L, 260L, 361L, 252L, 422L, 273L, 410L), X1 = c(100L, 
125L, 220L, 205L, 300L, 255L, 225L, 175L, 270L, 170L, 155L, 190L, 
140L, 290L, 265L, 105L, 215L, 270L, 255L, 175L, 135L, 200L, 275L, 
155L, 320L, 190L, 295L), X2 = structure(c(2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Line1", "Line2"), class = "factor")), .Names = c("Y", 
"X1", "X2"), row.names = c(NA, -27L), class = "data.frame")

# Define the models
fit  <- lm(Y ~ X1 * X2, df)
fit1 <- lm(Y ~ X1, subset(df, X2 == "Line1"))
fit2 <- lm(Y ~ X1, subset(df, X2 == "Line2"))
fitt <- lm(Y ~ X2/X1 -1, df)

par(mfrow = c(2,2))
# Subset Models Approach
plot  (Y ~ X1, subset(df, X2 == "Line1"), pch = 19, xlim=c(100, 350), ylim=c(100, 500)); abline(fit1)
points(Y ~ X1, subset(df, X2 == "Line2"), col = "red", pch = 19); abline(fit2, col = "red")

# Manual Approach
plot(Y ~ X1, df, col = X2, pch = 19, xlim=c(100, 350), ylim = c(100, 500))
abline(coef(fit)[[1]], coef(fit)[[2]])
abline(coef(fit)[[1]] + coef(fit)[[3]], coef(fit)[[2]] + coef(fit)[[4]], col = "red")

# New Approach
plot(Y ~ X1, df, col = X2, pch = 19, xlim=c(100, 350), ylim = c(100, 500))
abline(coef(fitt)[c(T, F)])                   # I could be explicit here, but odds and
abline(coef(fitt)[c(F, T)], col = "red")      # even refer to the 2 different models
As you can see, all the outputs are identical (and that can be checked by looking at the coefficients, too). I just thought that this was a very big improvement over my initial (manual) approach, that makes it much more difficult to handle complex model plotting. I think this new approach I learned both opens my eyes to the ways you can specify models in R and how it is then easier to handle plotting those results.
 

bryangoodrich

Probably A Mammal
Does anyone know what type of model that Y ~ X2/X1 -1 is or is called? I investigated it a little further. The model matrix makes sense.

Code:
model.matrix(fitt)
   X2Line1 X2Line2 X2Line1:X1 X2Line2:X1
1        0       1          0        100
2        0       1          0        125
3        0       1          0        220
4        0       1          0        205
5        0       1          0        300
6        0       1          0        255
7        0       1          0        225
8        0       1          0        175
9        0       1          0        270
10       0       1          0        170
11       0       1          0        155
12       0       1          0        190
13       0       1          0        140
14       0       1          0        290
15       0       1          0        265
16       1       0        105          0
17       1       0        215          0
18       1       0        270          0
19       1       0        255          0
20       1       0        175          0
21       1       0        135          0
22       1       0        200          0
23       1       0        275          0
24       1       0        155          0
25       1       0        320          0
26       1       0        190          0
27       1       0        295          0
attr(,"assign")
[1] 1 1 2 2
attr(,"contrasts")
attr(,"contrasts")$X2
[1] "contr.treatment"
The variable names suggest it's fitting a model Y ~ Line1 + Line2 + X1:Line1 + X1:Line2. To state that accurately, though, it's just the model frame above where Line1 and Line2 are just X2 w/o the intercept term. But I don't get the interaction term. I tried to recreate this with an interaction model as follows

Code:
fit <- lm(Y ~ X1*X2 - 1, df)
coefs(fit)
#         X1    X2Line1    X2Line2 X1:X2Line2 
#  1.3220488  7.5744646 97.9653278 -0.1766614
The X1 matches the fitt X2Line1:X1 interaction. The X2Line1 and X2Line2 match directly for obvious reasons. However, this model requires the summation of X1 and X1:X2Line2 to match the fitt X2Line2:X1 interaction.

Is there a way to recreate the Y ~ X2/X1 -1 model with the formula notation I'm familiar with? It seems like I need a way to specify "remove the intercept term" for the interaction X1:X2 analogous to how we do it with X2 alone--i.e., if the intercept were in there, then X2Line1 would just be the intercept value and X2Line2 would be that intercept amount less. So we remove the intercept so each factor level has a represented mean value in this model. The difference in coefficients I'm observing between the interaction model I defined above and the X2/X1 notation seems to be analogous in some ways to this. The X1 value is basically the "X2Line1:X1" and I want an X2Line2:X1. Instead, the interaction model has X1:X2Line2, which shows how much difference we have at the Line2 level from X1 at the Line1 level. Make sense? So in my mind, I'm perceiving this X2/X1 notation to basically be that "remove the intercept term" sort of maneuver. It lets the coefficients for those variables be fully represented, not represented as their difference (requiring summation). Make sense?

So my guess is there is no way to recreate the model using the +, *, or : notations I know you can use in formulas?
 

bryangoodrich

Probably A Mammal
Update: Looks like I solved my own question. I decided to explore the ANOVA of these models, which is quite telling. The equivalent model is to exclude the X1 variable entirely. Instead, we're modeling the factor X2 and its interaction with X1. This then exposes the 2 separate models for each factor level that I could have otherwise approached. This method gives it to me in one model with separate intercept terms contained therein. I'm not used to this nested stuff, so, it's still new to me, and TIL something else!

Code:
lm(Y ~ 0 + X2 + X2:X1, df)

# Call:
# lm(formula = Y ~ 0 + X2 + X2:X1, data = df)
# 
# Coefficients:
#    X2Line1     X2Line2  X2Line1:X1  X2Line2:X1  
#      7.574      97.965       1.322       1.145
I guess the next big thing I should learn in this regard, as one should learn when understanding any model, is when this model is appropriate.

Writing out the details of this model it makes sense how it works as containing the two separate models Y ~ X1 when X2 is Line1 and Y ~ X1 when X2 is Line2 (having different intercepts). The full expression is

\(Y = b_0 X_{2,1} + b_1 X_{2,2} + b_2 X_{2,1}X_1 + b_3 X_{2,2}X_1\)

So when \(X_2 = 1\) we get

\(Y = b_0 + b_2 X_1\)

and when \(X_2 = 2\) we get

\(Y = b_1 + b_3 X_1\)

The coefficients on the separate X2 levels are the separate intercept terms, and depending on which factor there is will determine which corresponding coefficient will remain on the appropriate X1. Neat.
 

Dason

Ambassador to the humans
TIL: "[<-" doesn't quite do what I thought it did. I just assumed that it combined assignment and indexing. Apparently that's only partially true and although it will modify the vector/matrix - the assignment is only temporary.

Typically this wouldn't cause any shock because R works copies but at some point the assignment needs to take place. For example:

Code:
> x <- 1:10
> "[<-"(x, 2, 0)
 [1]  1  0  3  4  5  6  7  8  9 10
> x # no change...
 [1]  1  2  3  4  5  6  7  8  9 10
> # I was expecting the same results as:
> x[2] <- 0
> x
 [1]  1  0  3  4  5  6  7  8  9 10
> x <- 1:10
> # This works though...
> "<-"("["(x, 2), 0)
> x
 [1]  1  0  3  4  5  6  7  8  9 10
> # But I'm guessing this is what's actually going
> # on when you do something like x[2] <- 0
> x <- 1:10
> "<-"(x, "[<-"(x, 2, 0))
> x
 [1]  1  0  3  4  5  6  7  8  9 10
 

bryangoodrich

Probably A Mammal
I'm surprised "[<-" makes sense, because you're trying to use it in a prefix fashion, but you're referencing two different functions. Odd.

If we think of assignment as A and index as B, then the notation makes perfect sense

A(B(x, 2), 0)

We index 'x' by 2. Thus, by function composition we pass that to the assignment statement, assigning that index of x the value of 0.

My guess is that "[<-" was sort of like calling a nested function. While the behavior took place for you to see, it was within the nested scope, and so nothing actually changed.

This made me think of trying "[<<-", to see if it would break out of the nested reference, but it says the function doesn't exist. So then "[<-" is actually its own function?? Still odd.
 

Dason

Ambassador to the humans
"[<-" is a function itself.

Code:
get("[<-")
get("[<-.data.frame")
although it seems like most of these "***<-" functions act this way:

Code:
> x <- 2
> get("class<-")(x, "hey")
[1] 2
attr(,"class")
[1] "hey"
> x
[1] 2
attr(,"class")
[1] "hey"
> "<-"(x, get("class<-")(x, "hey"))
> x
[1] 2
attr(,"class")
[1] "hey"
 

bryangoodrich

Probably A Mammal
It gets stranger ...

Code:
(x = seq(10))
#  [1]  1  2  3  4  5  6  7  8  9 10

"[<-"(x,2,0)  # No change
#  [1]  1  0  3  4  5  6  7  8  9 10

x
#  [1]  1  2  3  4  5  6  7  8  9 10

x[3] <- 0
x
#  [1]  1  2  0  4  5  6  7  8  9 10

"[<-"(x,2,0)  # Change!
#  [1]  1  0  0  4  5  6  7  8  9 10

x
#  [1]  1  0  0  4  5  6  7  8  9 10
Once you alter 'x' the "[<-" changes stick?? WTF ...
 

trinker

ggplot2orBust
TIL how to scrap the web on my own.

When i say scrape I mean go to multiple pages and extract information.

I pulled emoticons from a website. here's on page from that site: http://www.lingo2word.com/lists/emoticon_listH.html

This is a baby example as I'm thinking of including an emoticon data set with my qdap package. It's a baby example (as everything is straight forward and doesn't require the use of RCurl) and I'm sure BG and Dason could whip something up even better but it was my first crack at it.

Code:
y <- readLines(url("http://www.lingo2word.com/lists/"))  [COLOR="gray"]#get the list of all the site's web pages[/COLOR]
y <- strsplit(y[-c(1:8, 942:944)], '"')[COLOR="gray"]#eliminate rows that are nothing and split on "[/COLOR]
y <- do.call(rbind, y)                 [COLOR="gray"]#bind that stuff up[/COLOR]                      
a <- "emoticon"
b <- y[, 2][substring(y[, 2], 1, nchar(a))==a]  [COLOR="gray"]#get only the emoticon list[/COLOR] 

[COLOR="gray"]#b is the ending to all urls[/COLOR]                 

require(XML)
urla <- "http://www.lingo2word.com/lists/"    [COLOR="gray"]#begining of all the urls[/COLOR]         

z <- lapply(1:length(b), function(i) {        [COLOR="gray"]#scrape that data[/COLOR]         
        readHTMLTable(paste0(urla, b[i]), which=2)
    }
)


z2 <- lapply(z, function(x) x[, c(2, 4)])   [COLOR="gray"]#take only the 2nd and 4th item[/COLOR]         
d <- do.call(rbind, z2)                     [COLOR="gray"]#bind the dataframe[/COLOR]                                   
rownames(d) <- NULL                         [COLOR="gray"]#rename sequentialy[/COLOR]         
names(d) <-c("emoticon", "meaning")         [COLOR="gray"]#rename columns[/COLOR]         

d[, 1] <- gsub("^\\s+", "", d[, 1])         [COLOR="gray"]#find excessive white space[/COLOR]         
d[d[, 1]=="", 1] <- NA                      [COLOR="gray"]#if a cell is empty give it an NA[/COLOR]         
d <- na.omit(d)                             [COLOR="gray"]#Die missing values Die[/COLOR]  
d
 
Last edited:

Dason

Ambassador to the humans
TIL: Bioconductor's guidelines for package submission are about a million times more stringent than CRAN's.

Bioconductor:

Packages must satisfy the following checklist:
Pass R CMD build and R CMD check on all supported platforms (Windows, Macintosh, Linux) with no errors or warnings, using a recent R-devel. The result of R CMD build must be less than 2MB; R CMD check must complete within 5 minutes.
Contain a DESCRIPTION file with valid contact information, an informative title and description, correct license specification, appropriate biocViews terms, valid version number.
Contain a NAMESPACE that imports all symbols used in the package, and exports just those symbols the package author identifies as appropriate. Use of a NAMESPACE implies that appropriate packages are mentioned in the Imports: field of the DESCRIPTION file.
Contain a Sweave-style vignette that illustrates the major uses of the package. The vignette must be evaluated during package installation; a static vignette (e.g., PDF) is not acceptable.
Contain complete help pages. This includes accurate description of function parameter and return values, and meaningful examples.
Make use of appropriate existing packages (e.g., biomaRt, AnnotationDbi, Biostrings) and classes (e.g., ExpressionSet, AnnotatedDataFrame, RangedData, Rle, DNAStringSet) to avoid duplication of functionality available in other Bioconductor packages.
Contain no extraneous files (e.g., '.DS_Store', '.project', '.svn', etc.), files with invalid names (e.g., differing only in case), or code that cannot be distributed under the license specified by the author.
Packages must not already be available on CRAN.
Packages should have a descriptive name that is not already in use. See if it is by running biocLite("myPackageName"). You cannot have a package name that is case-insensitively equal to an existing package name in CRAN or Bioconductor.
Packages should also conform to the following:
Use S4 classes and methods.
Include an inst/NEWS file for providing users with information on package updates.

Authors receive initial feedback in 1 to 3 weeks. Packages will be checked for adherence to the Bioconductor package guidelines and the checklist below by a member of Bioconductor team. Package developers should consult the full Bioconductor Package Guidelines. Submission implies commitment to package maintenance across multiple release cycles.

And this is essentially what CRAN requires:
Does it pass R CMD check? K - you're good.
 

bryangoodrich

Probably A Mammal
0_0

Wow. That's about all I can say to that. At least you know you're getting a good product when it comes through Bioconductor! The Data Mining using R book we've talked about in the chatbox before makes use of their stuff.
 
Don't know where else to write this, logged on for first time in months. Just wanted to say thanks to anybody that helped me throughout my Msc statistics course. Names I remember from top of my head are Dason and bryan. Thanks a lot for your help with R and general stats :)
 

trinker

ggplot2orBust
Didn't know about the recursive argument to unlist. It's pretty nice for turning nested lists into one list (picked this up at SO):

Code:
A = data.frame( a = c(1:10), b = c(11:20) )
B = data.frame( a = c(101:110), b = c(111:120) )
C = data.frame( a = c(5:8), b = c(55:58) )

L = list(list(B,C),list(A),list(C,A),list(A,B,C),list(C))

unlist(L,recursive=F)