# Bookclub: Data Mining with R

#### bryangoodrich

##### Probably A Mammal
Here I want to begin a book club for talking about Data Mining with R. I know a few of us here at TS have in our possession or have looked at this book. This format for the club is on-going. You participate at-will. If you don't have the book, check it out this summer and participate! The point is to have a repository of information, questions, and discussion on the contents. These may be theoretical, to which anyone can answer, or they may be specific, to which only those with the book may be supportive. In any case, I hope this minimal format will produce more participation than we've seen in the past (you slackers!).

You can get the data from the website linked above, but better yet, just use their package

Code:
install.packages("DMwR")
This gives you the data sets

Code:
algae                   Training data for predicting algae blooms
test.algae (testAlgae)  Testing data for predicting algae blooms
algae.sols (algaeSols)  The solutions for the test data set for
predicting algae blooms
GSPC                    A set of daily quotes for SP500
sales                   A data set with sale transaction reports
This covers the main 3 cases, but not the last (microarray samples). Instead, you have to run this once

Code:
source("http://bioconductor.org/biocLite.R")
biocLite()
biocLite("ALL")
Then you can access the data

Code:
library(Biobase)
library(ALL)
data(ALL)

#### vinux

##### Dark Knight
I am in. I could do something productive in the Finance case study.

#### laurits

##### New Member
Hi

Have any of you worked through chapter 3 - Predicting Stock Markets?

It's a really good introduction to many useful R functions for predicting and testing. However, the final section leaves you a bit lost. Have any of you worked out how to obtain the predicted signal for today's (or most recent) data point? Would like to hear from you.

Regards,

Laurits

#### rogojel

##### TS Contributor
Hi,
any chance of resurrecting this thread? I know I am a bit late, but I just started to work through the book and it would be great to have a place to discuss it.

Hapy new year!
rogojel

#### hlsmith

##### Less is more. Stay pure. Stay poor.
Rogojel I am super bored right now and uninspired. I am late to your post but if you want to work through the book I am game given the book is free or I can find access to it!

#### rogojel

##### TS Contributor
Great! how do we start,mI've never done something like this. I am definitely interested in the algae data BTW.

regards

#### hlsmith

##### Less is more. Stay pure. Stay poor.
Let me see if I can get access to the book? Is it free online? I am going to check a library.

#### hlsmith

##### Less is more. Stay pure. Stay poor.
Alright, I just downloaded the book. For full disclosure, I don't use R. I totally get most of the syntax, but I know none of the basic data management code. I propose that I spend this week reading the first intro chapter and getting up to speed. Then next week we can start going through the real content and exchanging questions and code.

Do you use or have used R?

#### bryangoodrich

##### Probably A Mammal
I totally forgot about this thread. I think I have an ebook version of this, if I'm not mistaken. I wouldn't mind getting some book club stuff going, though! Unfortunately, I'm also in the midst of undertaking a **** ton of training (Microsoft SQL Server [MSCA], MongoDB, HBase, C# programming). I need to buff up my IT skills and get certified.

#### hlsmith

##### Less is more. Stay pure. Stay poor.
Sounds good. I typically use SAS, so I just wanted to make sure you weren't holding my hand through everything and that me may have comparable learning curves and questions.

I will start tackling the second chapter next week and we can discuss our progress.

Rogojel - I always imagine you were a quality engineer or were in the production industry. Is this correct?

#### rogojel

##### TS Contributor
hi hlsmith,
yes, more or less. I am an ex- physicist turned ex-C++ programmer working in six sigma consulting with a long-time interest in biology and stochastic processes.

#### rogojel

##### TS Contributor
Hi,
looking at the algae dataset I wrote this function to make visualitation easier:

Code:
#define function to plot multiple density plots
plot.multiple=function(DV){
require(ggplot2)
qplot(DV, geom="density", data=RawDat, fill=size, alpha=0.5,
facets=speed~season)
}
Not as general as it could be but makes it easy to look at different variables.

#### melrick

##### New Member
Hi,

I am trying to do Turf Analysis using R, Unfortunately I could not get the output.

Can anyone help me out?

#### rogojel

##### TS Contributor
Hi,
what is Turf analysis ? Are you using R Studio?
regards

#### hlsmith

##### Less is more. Stay pure. Stay poor.
rogojel,

Got a little busy, hopefully I will start tackling the book next week!

#### bryangoodrich

##### Probably A Mammal
Hi,
looking at the algae dataset I wrote this function to make visualitation easier:

Code:
#define function to plot multiple density plots
plot.multiple=function(DV){
require(ggplot2)
qplot(DV, geom="density", data=RawDat, fill=size, alpha=0.5,
facets=speed~season)
}
Not as general as it could be but makes it easy to look at different variables.
I'd recommend avoiding qplot. It was kind of a bridge function for people wanting something like base plotting but to use the ggplot paradigm. Better to just use ggplot the way it was intended, like this!

Code:
# returns ggplot plotting object invisibly
plot.multiple <-  function(x, alpha = 1, plot = TRUE)
{
require(ggplot2)
p <- ggplot(x) +
aes_string(x = "speed", y = "season", fill = "size") +
geom_density(alpha = alpha) +
facet_wrap(speed ~ season)

if (plot) print(p)  # plots p; to only return p just set plot = FALSE
invisible(p)  # returns object if assigned, otherwise dumps into the aether
}
I didn't test this at all so I could be full of ****. I'd also recommend adding a theme_bw(), at least I prefer it. You can add it in the function or let the user do it: e.g.,

Code:
library(ggplot2)
p <- plot.multiple(whatever, alpha = 0.5, plot = FALSE)
p + theme_bw()