How to identify (x,y) values from plots

#1
Hi,

I'm new with R and am using it within the program Alteryx. Specifically, I'm running a Forest model and am trying to interpret the "effect" of each important variable against my regressor individually.

I see I can plot the points via the partialPlot() function.
This is almost what I need, except I don't want the actual plots, I want the coordinates of the data in the plots, specifically the (x,y) parameters.

Here is my sample code to get the plots, I just need to know how to modify this to not only show me the plots, but also output the data in a table that I can use and manipulate downstream. I want to create my own plots in a different program.
Thanks in advance for any help!

Code:
library(randomForest)
data(airquality)
airquality <- na.omit(airquality)
set.seed(131)
ozone.rf <- randomForest(Ozone ~ ., airquality, importance=TRUE)
imp <- importance(ozone.rf)
impvar <- rownames(imp)[order(imp[, 1], decreasing=TRUE)]
op <- par(mfrow=c(2, 3))
for (i in seq_along(impvar)) {
partialPlot(ozone.rf, airquality, impvar[i], xlab=impvar[i],
main=paste("Partial Dependence on", impvar[i]),
ylim=c(30, 70))
}
par(op)
 
Last edited:

trinker

ggplot2orBust
#2
I can point you in the correct direction but it sounds like you're newer to R and need some understanding on how to look at and grab pieces from R objects. Now ozone.rf is actually a list with a special print method that makes it look pretty and trim. But you can access all its goodness.

1. Look at what the object really is with:
Code:
names(ozone.rf)
str(ozone.rf)
2. Index the pieces with:
http://cran.r-project.org/doc/manuals/R-intro.html#Lists

One last thing when you're posting code, dataframes or computer output it's helpful to wrap this information in code tags by:
  1. either clicking the pound (#) sign icon or
  2. wrap with [NOPARSE]
    Code:
    some code
    [/NOPARSE]

which produces:
Code:
some code
For more see this (LINK)
 
#3
Trinker - thanks for you reply. I've modified my original post with the Code tags. Thanks.

Unfortunately, I'm still lost on how to extract my desired data. The "str(ozone.rf)" function is very eye opening, but I don't see the data here that I need to extract or "index the pieces". Forgive my ignorance.

The data used to create the plots seems, to me at least, come from performing additional analysis on the "ozone.rf" object, therefore it wouldn't appear in the str(ozone.rf), right?

So, I may be going way off base here, but I modified my original code and created an object, "Test", which seems to output the data I desire, however only for the first most important variable. I need each important variable to have (x,y) coordinates, all in the same table/stream/output. I hope I'm making sense. I really do appreciate any and all help provided.

Code:
library(randomForest)
data(airquality)
airquality <- na.omit(airquality)
set.seed(131)
ozone.rf <- randomForest(Ozone ~ ., airquality, importance=TRUE)
imp <- importance(ozone.rf)
impvar <- rownames(imp)[order(imp[, 1], decreasing=TRUE)]
op <- par(mfrow=c(2, 3))
for (i in seq_along(impvar)) 
#modified code
Test <-
#/modified code
{partialPlot(ozone.rf, airquality, impvar[i], xlab=impvar[i],
main=paste("Partial Dependence on", impvar[i]),
ylim=c(30, 70))
}
par(op)
Then I run:
Code:
Str(Test)
 

trinker

ggplot2orBust
#4
I started grabbing what you're after but I don't actually know what you're after so here is the code I started messing with to get you started:

Code:
names(ozone.rf)
randomForest:::plot.randomForest
randomForest:::partialPlot.randomForest
ozone.rf[["y"]]


test <- 
function (x, pred.data, x.var, which.class, w, plot = TRUE, add = FALSE, 
    n.pt = min(length(unique(pred.data[, xname])), 51), rug = TRUE, 
    xlab = deparse(substitute(x.var)), ylab = "", main = paste("Partial Dependence on", 
        deparse(substitute(x.var))), ...) 
{

    classRF <- x$type != "regression"
    if (is.null(x$forest)) 
        stop("The randomForest object must contain the forest.\n")
    x.var <- substitute(x.var)
    xname <- if (is.character(x.var)) 
        x.var
    else {
        if (is.name(x.var)) 
            deparse(x.var)
        else {
            eval(x.var)
        }
    }
    xv <- pred.data[, xname]
    n <- nrow(pred.data)
    if (missing(w)) 
        w <- rep(1, n)
    if (classRF) {
        if (missing(which.class)) {
            focus <- 1
        }
        else {
            focus <- charmatch(which.class, colnames(x$votes))
            if (is.na(focus)) 
                stop(which.class, "is not one of the class labels.")
        }
    }
browser()
    if (is.factor(xv) && !is.ordered(xv)) {
        x.pt <- levels(xv)
        y.pt <- numeric(length(x.pt))
        for (i in seq(along = x.pt)) {
            x.data <- pred.data
            x.data[, xname] <- factor(rep(x.pt[i], n), levels = x.pt)
            if (classRF) {
                pr <- predict(x, x.data, type = "prob")
                y.pt[i] <- weighted.mean(log(ifelse(pr[, focus] > 
                  0, pr[, focus], .Machine$double.eps)) - rowMeans(log(ifelse(pr > 
                  0, pr, .Machine$double.eps))), w, na.rm = TRUE)
            }
            else y.pt[i] <- weighted.mean(predict(x, x.data), 
                w, na.rm = TRUE)
        }
        if (add) {
            points(1:length(x.pt), y.pt, type = "h", lwd = 2, 
                ...)
        }
        else {
            if (plot) 
                barplot(y.pt, width = rep(1, length(y.pt)), col = "blue", 
                  xlab = xlab, ylab = ylab, main = main, names.arg = x.pt, 
                  ...)
        }
    }
    else {
        if (is.ordered(xv)) 
            xv <- as.numeric(xv)
        x.pt <- seq(min(xv), max(xv), length = n.pt)
        y.pt <- numeric(length(x.pt))
        for (i in seq(along = x.pt)) {
            x.data <- pred.data
            x.data[, xname] <- rep(x.pt[i], n)
            if (classRF) {
                pr <- predict(x, x.data, type = "prob")
                y.pt[i] <- weighted.mean(log(ifelse(pr[, focus] == 
                  0, .Machine$double.eps, pr[, focus])) - rowMeans(log(ifelse(pr == 
                  0, .Machine$double.eps, pr))), w, na.rm = TRUE)
            }
            else {
                y.pt[i] <- weighted.mean(predict(x, x.data), 
                  w, na.rm = TRUE)
            }
        }
        if (add) {
            lines(x.pt, y.pt, ...)
        }
        else {
            if (plot) 
                plot(x.pt, y.pt, type = "l", xlab = xlab, ylab = ylab, 
                  main = main, ...)
        }
        if (rug && plot) {
            if (n.pt > 10) {
                rug(quantile(xv, seq(0.1, 0.9, by = 0.1)), side = 1)
            }
            else {
                rug(unique(xv, side = 1))
            }
        }
    }
    invisible(list(x = x.pt, y = y.pt))
}

for (i in seq_along(impvar)) {
    test(ozone.rf, airquality, impvar[i], xlab=impvar[i],
    main=paste("Partial Dependence on", impvar[i]),
    ylim=c(30, 70))
}
PS I mostly stopped messing because I didn't enjoy reading the code from the randomForest package much.
 
#5
Thanks for work, however, it still doesn't work for me.

I've attached a screen shot I took of the 5 most important variables and the plots derived from my original post's code - partialPlot(). It is essentially these plots that I would like to retrieve the data.


My poor attempt of trying to get this data resulted in this structure list: It's not the most reasonable output I'm looking for, but it in fact does have the x values and y values used to make the plot of the variable "Day" from my attached picture.
I just wish I had this same list for all the important variables, albeit in a more readable manner, but the structure below can work for me.


structure(list(x = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,
14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29,
30, 31),

y = c(48.1147595475563, 42.8428368796405, 41.5167727226263,
41.4925832903369, 41.3023780565317, 41.313065386719, 41.4437245484782,
42.2206116141152, 42.5068853262889, 41.3705298631566, 40.6281081801673,
39.2997257584599, 39.1239926610768, 39.2053767767714, 39.1037795438242,
39.5093202423149, 39.9498305474252, 39.4948548573811, 39.5126724892987,
39.4687776488539, 39.0677724150487, 39.0955472648985, 40.5740589658852,
43.3316752250015, 49.3108668472757, 46.1778781943371, 45.6596944006533,
45.8560040102629, 46.1934238471739, 47.9718691549192, 48.1891804844306
)), .Names = c("x", "y"))
 

trinker

ggplot2orBust
#7
This is a hack but try:

Code:
out <- list()
for (i in seq_along(impvar)) {
    Test <- NULL
    Test <- partialPlot(ozone.rf, airquality, impvar[i], xlab=impvar[i],
        main=paste("Partial Dependence on", impvar[i]), ylim=c(30, 70))
 
    out[[impvar[i]]] <- Test
}
 
#9
Your latest code works! Thank you! The out table has all the (x,y) values for all the important variables! I really appreciate your help.

However, typing out[[]] results in an error for me.
Error in out[[]] : invalid subscript type 'symbol'
 

trinker

ggplot2orBust
#10
Well that's because I didn't put in the pieces between the brackets. This is called indexing a list and I'd recommend you read up on it:

http://cran.r-project.org/doc/manuals/R-intro.html#Lists

This is where the names and str functions are helpful. You can use a numeric or character index.

This should get you started.
Play with

Code:
names(out)
out[["Temp"]]
out[[1]]
out[[c(1, 1)]]
out[[c(1, 2)]]
out[["Temp"]][["y"]]
out[[c("Temp", "y")]]
Yields
Code:
> names(out)
[1] "Temp"    "Wind"    "Solar.R" "Month"   "Day"    
> out[["Temp"]]
$x
 [1] 57.00000 58.05263 59.10526 60.15789 61.21053 62.26316 63.31579 64.36842
 [9] 65.42105 66.47368 67.52632 68.57895 69.63158 70.68421 71.73684 72.78947
[17] 73.84211 74.89474 75.94737 77.00000 78.05263 79.10526 80.15789 81.21053
[25] 82.26316 83.31579 84.36842 85.42105 86.47368 87.52632 88.57895 89.63158
[33] 90.68421 91.73684 92.78947 93.84211 94.89474 95.94737 97.00000

$y
 [1] 30.29440 30.49024 30.47367 30.78293 30.81787 30.90595 30.90549 31.01139
 [9] 31.32632 31.42744 31.38043 31.24794 31.28699 31.37103 31.39030 31.18292
[17] 31.29159 31.54717 31.83141 32.25970 37.96850 41.38459 40.91690 41.73817
[25] 40.76564 48.21323 52.71842 54.50170 54.33639 58.71413 62.53354 62.69456
[33] 62.78035 62.73559 62.59925 62.72851 62.65659 62.36114 62.32743

> out[[1]]
$x
 [1] 57.00000 58.05263 59.10526 60.15789 61.21053 62.26316 63.31579 64.36842
 [9] 65.42105 66.47368 67.52632 68.57895 69.63158 70.68421 71.73684 72.78947
[17] 73.84211 74.89474 75.94737 77.00000 78.05263 79.10526 80.15789 81.21053
[25] 82.26316 83.31579 84.36842 85.42105 86.47368 87.52632 88.57895 89.63158
[33] 90.68421 91.73684 92.78947 93.84211 94.89474 95.94737 97.00000

$y
 [1] 30.29440 30.49024 30.47367 30.78293 30.81787 30.90595 30.90549 31.01139
 [9] 31.32632 31.42744 31.38043 31.24794 31.28699 31.37103 31.39030 31.18292
[17] 31.29159 31.54717 31.83141 32.25970 37.96850 41.38459 40.91690 41.73817
[25] 40.76564 48.21323 52.71842 54.50170 54.33639 58.71413 62.53354 62.69456
[33] 62.78035 62.73559 62.59925 62.72851 62.65659 62.36114 62.32743

> out[[c(1, 1)]]
 [1] 57.00000 58.05263 59.10526 60.15789 61.21053 62.26316 63.31579 64.36842
 [9] 65.42105 66.47368 67.52632 68.57895 69.63158 70.68421 71.73684 72.78947
[17] 73.84211 74.89474 75.94737 77.00000 78.05263 79.10526 80.15789 81.21053
[25] 82.26316 83.31579 84.36842 85.42105 86.47368 87.52632 88.57895 89.63158
[33] 90.68421 91.73684 92.78947 93.84211 94.89474 95.94737 97.00000
> out[[c(1, 2)]]
 [1] 30.29440 30.49024 30.47367 30.78293 30.81787 30.90595 30.90549 31.01139
 [9] 31.32632 31.42744 31.38043 31.24794 31.28699 31.37103 31.39030 31.18292
[17] 31.29159 31.54717 31.83141 32.25970 37.96850 41.38459 40.91690 41.73817
[25] 40.76564 48.21323 52.71842 54.50170 54.33639 58.71413 62.53354 62.69456
[33] 62.78035 62.73559 62.59925 62.72851 62.65659 62.36114 62.32743
> out[["Temp"]][["y"]]
 [1] 30.29440 30.49024 30.47367 30.78293 30.81787 30.90595 30.90549 31.01139
 [9] 31.32632 31.42744 31.38043 31.24794 31.28699 31.37103 31.39030 31.18292
[17] 31.29159 31.54717 31.83141 32.25970 37.96850 41.38459 40.91690 41.73817
[25] 40.76564 48.21323 52.71842 54.50170 54.33639 58.71413 62.53354 62.69456
[33] 62.78035 62.73559 62.59925 62.72851 62.65659 62.36114 62.32743
> out[[c("Temp", "y")]]
 [1] 30.29440 30.49024 30.47367 30.78293 30.81787 30.90595 30.90549 31.01139
 [9] 31.32632 31.42744 31.38043 31.24794 31.28699 31.37103 31.39030 31.18292
[17] 31.29159 31.54717 31.83141 32.25970 37.96850 41.38459 40.91690 41.73817
[25] 40.76564 48.21323 52.71842 54.50170 54.33639 58.71413 62.53354 62.69456
[33] 62.78035 62.73559 62.59925 62.72851 62.65659 62.36114 62.32743
It's ok to be a beginner but reading the CRAN manuals will help get you where you want to be as an R user (particularly the first 2).