Today I Learned: ____


Ambassador to the humans
Oh I didn't know you could use negative indexing in head! Very nice.

Also - TIL: Vectorize is awesome. Seriously. How many times have you written a function and realized later that you didn't write it to be vector friendly but you wanted to use it in a way that requires it to be vector friendly? Usually I just rewrite the function more intelligently (which isn't a bad thing to do...) but you could also (apparently) just toss it into Vectorize. Awesome.


Probably A Mammal
I also learned how to take a vector of widths and create start and end points in a vector based on those widths (good for manipulating fixed-width data). I think back to my C programming days, and you would do a simple for loop, and keep a running total of the positions. So if you have a string "Ilikecheeseandpie" and want to break it up by the vector c(1, 4, 6, 3, 3), you'd need the vector of positions in the string, not the width of each word: c(1, 2, 6, 11, 14, 17). To generate this, just think about what we need to do. Start with the first location (1). Then we add to it the first width (= 1 + 1). Then we add the next width (= 2 + 4), and so on. In the procedural way I used (very slow in R), you would have to keep track of these positions. I would usually make two vectors: start and end. I also was shown by this example that you could simply create the start positions. The end positions are just the start plus the widths (vectorized operation, adds a little speed). Creating the start vector, though, is so simple using cumsum, which I never think about its importance.

cumsum(c(1, widths))
generates all your start positions, but you don't need the last one. Thus, with the above tool, we could tie it all together.

head(cumsum(c(1, widths)), -1)


Ambassador to the humans
Better yet:
> x <- rnorm(100000)
> benchmark(head(x, -1), x[-length(x)])
           test replications elapsed relative user.self sys.self user.child sys.child
1   head(x, -1)          100    0.23 1.000000      0.24        0         NA        NA
2 x[-length(x)]          100    0.62 2.695652      0.63        0         NA        NA
I like the version using head more AND it appears to be faster. Double nice.


Probably A Mammal
I got similar results, but I also tested if we're excluding more than just the end. Turns out, efficiency is lost! It's pretty close regardless, and doesn't change dramatically if we're excluding 2 or 2000.

> benchmark(head(z, -2), z[1:(length(z)-2)], replications=1000)
                  test replications elapsed relative user.self sys.self user.child sys.child
1          head(z, -2)         1000    3.60 1.034483      2.98     0.61         NA        NA
2 z[1:(length(z) - 2)]         1000    3.48 1.000000      2.97     0.52         NA        NA

> benchmark(head(z, -1), z[-length(z)], replications=1000)
           test replications elapsed relative user.self sys.self user.child sys.child
1   head(z, -1)         1000    3.50 1.000000      2.97     0.52         NA        NA
2 z[-length(z)]         1000    3.77 1.077143      3.20     0.54         NA        NA


TIL: That hash tables are extremely fast and not too bad to do for looking up values (LINK)

My R brain grew 10 sizes today and yesterday (still fits in a thimble).


Don't laugh because this is so simple but I love it. Since I was knee high to a grasshopper I've been renaming a variable in a dataframe by:
names(dataframe) <- c('var1', 'var2', 'var3', '') or a function from a package like reshape (I think)

I realized today that I can hone in on the specific variable name with indexing and leave everything else alone.

names(mtcars)[names(mtcars)=='hp'] <-''
names(mtcars)[2] <-''
The first named one is longer to write but allows you to change the name without knowing the position of the variable in the dataframe.

EDIT: I went to add this to my notes I keep on R (going on 180 pages worth now) and I found I've already got it there. One of the first things I learned. But this forum is for things we forgot and remembered too. :)


Probably A Mammal
OMG That looks awesome. I heard about this Google API, and when it was brought up in that video Trinker linked on that chatbox, I was like "I should definitely learn this." Now lo and behold, there is already an R package going that route.


Probably A Mammal
When I started reading up on the data.table package the other day (week?), they mentioned that the motivation for it was to improve upon the A[X] formulation. What is that? It's where you have a matrix A and you want to access a subset of it by specifying another matrix X. This is why in data.table objects, you can do something like dt[J(...)] to access its content, and supposedly quickly at that. (Supposedly, because we found out it didn't go as quick as we thought on some other simple operations.)

I honestly had no idea how to even look up this A[X] thing, and Jake inadvertently showed me a way to take advantage of it by using double subsets--i.e., df[-1][df[-1] < 0] feeds a frame 'df', excluding its first column, a boolean matrix of fields in everything but the first column that meet a given criteria. We're going to see something similar below, except I'm working entirely numeric.

So, I decided to play around with matrices in R.

Take the 2D 5x5 matrix of position values

A <- matrix(c(seq(11, 51, 10), seq(12, 52, 10), seq(13, 53, 10),
              seq(14, 54, 10), seq(15, 55, 10)), 5, 5)
#      [,1] [,2] [,3] [,4] [,5]
# [1,]   11   12   13   14   15
# [2,]   21   22   23   24   25
# [3,]   31   32   33   34   35
# [4,]   41   42   43   44   45
# [5,]   51   52   53   54   55
Suppose I want to grab out that chunk

     [,2] [,3]
[2,]   22   23
[3,]   32   33
Then I need to define a matrix of position values. For a 2D object, I need a 2D matrix, one that lists the pairs for each position (i.e., the first column lists the row position and the second column lists the column position). It is like taking pairs of values

 2 2
 3 2
 2 3
 3 3
Now, this is not a particularly apt way to code it, but it illustrates the point

pairs <- c(
  c(2, 2),
  c(3, 2),
  c(2, 3),
  c(3, 3)
X <- matrix(pairs, 4, 2, byrow = TRUE) 
#      [,1] [,2]
# [1,]    2    2
# [2,]    3    2
# [3,]    2    3
# [4,]    3    3

# [1]  22 32 23 33
These are clearly the points that were desired, but given our query they need to be reconstructed how we desired.

matrix(A[X], 4/2, 2)  # Notice '4' is the number of points, and '2' is both the divisor and the number of columns. 
#      [,1] [,2]
# [1,]   22   23
# [2,]   32   33
I suspect there is a similar protocol for drilling into higher dimensional R arrays, merely specifying ordered triples, etc. It makes me curious if this cannot be extended to lists. They do not have the tabular form per se, but this is, in essence, nothing more than specifying points into a function to extract values from the given space. Here we extracting matrix values. If it applies to lists, we can extract more complex things. That could pose problems, which is why I wonder if it works. Conceptually, it is intriguing to me. Maybe I'll learn more about that tomorrow!

In closing, let's get a bit more complicated. Suppose I want to grab the points around the center (33), but not including the center. Then I'm interested in

22 23 24
32    34
42 43 44
This requires 9 points with an empty slot for what is being excluded.

pairs <- c(2, 2,  3, 2,   4, 2,
           2, 3,  NA, NA, 4, 3,
           2, 4,  3, 4,   4, 4

X <- matrix(pairs, 9, 2, byrow = TRUE)  # Same pattern as before: [I]matrix(pairs, npts, dim, byrow = TRUE)[/I]
matrix(A[X], 9/3, 3)  # Pattern: [I]matrix(A[X], npts/dim, dim)[/I]
#      [,1] [,2] [,3]
# [1,]   22   23   24
# [2,]   32   NA   34
# [3,]   42   43   44
Jake's boolean matrix inspired me to further investigate this A[X] operation, but I focused on position matrices themselves. I did that largely for pedagogical reasons. The fact is, it is probably not optimal or even convenient. However, when querying our table A, we may not have a condition to utilize that'll generate the boolean matrix we desire. The above situation could actually be generated in that approach pretty simply (simple in other regards)

X <- matrix(FALSE, 5, 5)  # dimensions of A
X[2:4, 2:4] <- TRUE  # Set them all to true
X[3,3] <- NA  # Specify our exclusion points in X
matrix(A[X], 3, 3)  # Apply the matrix subset
This is certainly nicer than having to specify a bunch of points. We still need to know what it is we're querying from the table, and that is the real brunt of this issue: how to specify queries of our A table through X. I use the term query because it makes a lot of sense, both in the SQL-esque context and in the fact it is representable in terms of boolean matrices that 'answer' the query. The point-method requires knowing the exact points. The boolean method lends itself to more generalization (ranges of points). Using boolean matrices has the benefit of thinking in terms of A. We design X, as above, as a boolean negative of A. We then specify our criteria to populate X with relevant TRUE's. We can then specify any exclusionary points. In reality, though, X may simply be an offspring, if you will, of A. Take, for instance, the example I referenced from the outset. I wanted to adjust all the points within a 10x4 matrix that were negative that were not within the first column. Thus, I exclude the first column and query the matrix: df[-1][df[-1] < 0]. This gives me all the relevant points. But as I said, I was merely accessing these points to change them. Here I have investigated how to generate a subset that maintains the structure of the table.


Ambassador to the humans
Why did you construct the pairs by hand when something like this would work:
> A <- matrix(c(seq(11, 51, 10), seq(12, 52, 10), seq(13, 53, 10),
+               seq(14, 54, 10), seq(15, 55, 10)), 5, 5)
> A[2:3, 2:3]
     [,1] [,2]
[1,]   22   23
[2,]   32   33


Cookie Scientist
The matrix subsetting thing is interesting, but like Dason hinted, it only seems obviously useful if you need to grab elements from a matrix according to some complex rule. If you just need to grab a rectangular block or something simple like that, it's not clear that you save any time by constructing the second matrix of indices. But it's good to know.

What you and I briefly talked about was slightly different: what I called double subsetting, which we might more precisely describe as taking the subset of a structure that you just got by subsetting another structure. It follows fairly straightforwardly from the realization that subsetting, say, a matrix A by doing something like A[2,] is equivalent to explicitly calling the "[" function by writing "["(A, 2, ). And so there's no reason why we couldn't take this return value and input it to another subsetting call, just like we might do for any other combination of functions. So if we wanted to select the 3rd element of the 2nd row, we could do it in the latter fashion with "["("["(A, 2,), 3). Or we could do it more cleanly with A[2,][3]. Obviously double setting is not really useful in this example since we could have just said A[2,3]. I chose this simple example only for didactic purposes.


Probably A Mammal
Why did you construct the pairs by hand when something like this would work:
Purely pedagogical.


While you were doing double subsetting, it is a matrix subset of A, when A was my df[-1] frame. The "A < 0" defines our matrix X in this A[X] subsetting. As I discuss at the end of my post, unlike using a matrix of positional pairs, we used an entire matrix the size of A to specify which points were to be kept, no different than vector subsetting using a boolean vector or its positions (say, using which on the boolean vector). It is fundamentally different, no doubt, but the difference is important and the analog carries over. What I wonder is -how- it carries over. For instance, the use of a position vector to specify points in a vector to keep is easily translated from a boolean vector: we merely return the points in the equal-size boolean vector for which the values are TRUE. This isn't quite the same with using a boolean matrix in matrix subsetting. There are two ways I conceive of it. Either there is a translation similar to what I did above, where that translation is from a boolean matrix into a 2 column matrix of position pairs, or it is doing a similar technique to the conversion of a boolean vector, since matrices are ultimately nothing more than a vector with dimensions. The results are the same, of course, but the mechanics of that process are still interesting to uncover.

The benefits of this? None, really. Though, as I said, it is the inspiration for the approach the data.table package takes on handling tables, and in a way it does follow the database view of treating these like tables for which we run queries against to extract tables from it. It is similar to thinking of how R handles data types compared to the way other systems do, like Matlab or SAS/IML handles matrices as their base objects as opposed to R using vectors, yet they're not that different fundamentally. Python can handle both, depending on what library you use--e.g., Python's complex data type is a list object similar to an array or an R vector, but with the scipy library it turns them into more formal vector or matrix like objects.



Often I want to display a table and a graphic(s) all at once. I've found hackish ways to "plot" tables before but came across the package gplots today. It enables you to print what's in the console to a graphics device. I'm not saying it's beautiful but very functional (with some effort it could look pretty nice but not LATEX nice). For my purposes this package provides a very quick way to plot text displays from the console. Very cool.

Try it out:
# show the alphabet as a matrix
textplot( matrix(letters[1:26], ncol=2))

### Make a nice 4 way display with two plots and two text summaries
plot( Sepal.Length ~ Species, data=iris, border="blue", col="cyan",
    main="Boxplot of Sepal Length by Species" )
plotmeans( Sepal.Length ~ Species, data=iris, barwidth=2, connect=FALSE,
    main="Means and 95% Confidence Intervals\nof Sepal Length by Species")
info <- sapply( split(iris$Sepal.Length, iris$Species),
    function(x) round(c(Mean=mean(x), SD=sd(x), N=gdata::nobs(x)),2) )
textplot( info, valign="top" )
title("Sepal Length by Species")
reg <- lm( Sepal.Length ~ Species, data=iris )
textplot( capture.output(summary(reg)), valign="top")
title("Regression of Sepal Length by Species")

### Show how to control text color
cols <- c("red", "green", "magenta", "forestgreen")
mat <- cbind(name=cols, t(col2rgb(cols)), hex=col2hex(cols))
textplot(mat,, nrow=length(cols), byrow=FALSE, ncol=5),)

### Show how to manually tune the character size
reg <- lm( Sepal.Length ~ Species, data=iris )
text <- capture.output(summary(reg))

# do the plot and capture the character size used
textplot(text, valign="top")
Example output:
View attachment 1735


Probably A Mammal

I like that. It isn't great, but often people will still just take a screenshot of a display like that to put with a graphic just so you can see the data or some output along with the image. This gives you control over that to automate it. I'll have to remember this, and sort of want to know how it does it!