Bookclub: Data Mining with R

bryangoodrich

Probably A Mammal
#1
Here I want to begin a book club for talking about Data Mining with R. I know a few of us here at TS have in our possession or have looked at this book. This format for the club is on-going. You participate at-will. If you don't have the book, check it out this summer and participate! The point is to have a repository of information, questions, and discussion on the contents. These may be theoretical, to which anyone can answer, or they may be specific, to which only those with the book may be supportive. In any case, I hope this minimal format will produce more participation than we've seen in the past (you slackers!).

You can get the data from the website linked above, but better yet, just use their package

Code:
install.packages("DMwR")
This gives you the data sets

Code:
algae                   Training data for predicting algae blooms
test.algae (testAlgae)  Testing data for predicting algae blooms
algae.sols (algaeSols)  The solutions for the test data set for
                        predicting algae blooms
GSPC                    A set of daily quotes for SP500
sales                   A data set with sale transaction reports
This covers the main 3 cases, but not the last (microarray samples). Instead, you have to run this once

Code:
source("http://bioconductor.org/biocLite.R")
biocLite()
biocLite("ALL")
Then you can access the data

Code:
library(Biobase)
library(ALL)
data(ALL)
 
#3
Hi

Have any of you worked through chapter 3 - Predicting Stock Markets?

It's a really good introduction to many useful R functions for predicting and testing. However, the final section leaves you a bit lost. Have any of you worked out how to obtain the predicted signal for today's (or most recent) data point? Would like to hear from you.

Regards,

Laurits
 

rogojel

TS Contributor
#4
Hi,
any chance of resurrecting this thread? I know I am a bit late, but I just started to work through the book and it would be great to have a place to discuss it.

Hapy new year!
rogojel
 

hlsmith

Not a robit
#5
Rogojel I am super bored right now and uninspired. I am late to your post but if you want to work through the book I am game given the book is free or I can find access to it!
 

hlsmith

Not a robit
#8
Alright, I just downloaded the book. For full disclosure, I don't use R. I totally get most of the syntax, but I know none of the basic data management code. I propose that I spend this week reading the first intro chapter and getting up to speed. Then next week we can start going through the real content and exchanging questions and code.


Do you use or have used R?
 

bryangoodrich

Probably A Mammal
#9
I totally forgot about this thread. I think I have an ebook version of this, if I'm not mistaken. I wouldn't mind getting some book club stuff going, though! Unfortunately, I'm also in the midst of undertaking a **** ton of training (Microsoft SQL Server [MSCA], MongoDB, HBase, C# programming). I need to buff up my IT skills and get certified.
 
#11
Sounds good. I typically use SAS, so I just wanted to make sure you weren't holding my hand through everything and that me may have comparable learning curves and questions.


I will start tackling the second chapter next week and we can discuss our progress.


Rogojel - I always imagine you were a quality engineer or were in the production industry. Is this correct?
 

rogojel

TS Contributor
#12
hi hlsmith,
yes, more or less. I am an ex- physicist turned ex-C++ programmer working in six sigma consulting with a long-time interest in biology :) and stochastic processes.
 

rogojel

TS Contributor
#13
Hi,
looking at the algae dataset I wrote this function to make visualitation easier:

Code:
#define function to plot multiple density plots
plot.multiple=function(DV){
  require(ggplot2)
  qplot(DV, geom="density", data=RawDat, fill=size, alpha=0.5, 
        facets=speed~season)  
}
Not as general as it could be but makes it easy to look at different variables.
 

bryangoodrich

Probably A Mammal
#17
Hi,
looking at the algae dataset I wrote this function to make visualitation easier:

Code:
#define function to plot multiple density plots
plot.multiple=function(DV){
  require(ggplot2)
  qplot(DV, geom="density", data=RawDat, fill=size, alpha=0.5, 
        facets=speed~season)  
}
Not as general as it could be but makes it easy to look at different variables.
I'd recommend avoiding qplot. It was kind of a bridge function for people wanting something like base plotting but to use the ggplot paradigm. Better to just use ggplot the way it was intended, like this!

Code:
# returns ggplot plotting object invisibly
plot.multiple <-  function(x, alpha = 1, plot = TRUE)
{
  require(ggplot2)
  p <- ggplot(x) + 
    aes_string(x = "speed", y = "season", fill = "size") + 
    geom_density(alpha = alpha) + 
    facet_wrap(speed ~ season)

    if (plot) print(p)  # plots p; to only return p just set plot = FALSE
    invisible(p)  # returns object if assigned, otherwise dumps into the aether
}
I didn't test this at all so I could be full of ****. I'd also recommend adding a theme_bw(), at least I prefer it. You can add it in the function or let the user do it: e.g.,

Code:
library(ggplot2)
p <- plot.multiple(whatever, alpha = 0.5, plot = FALSE)
p + theme_bw()