Could anyone recommend an interactive application for mass data analysis?

#1
Hi, everybody
I need to conduct a large amount of data analysis on database. Could anyone recommend an interactive application for data analysis?
The requirements are:
1. Able to cope with the unexpected requirement rapidly.
2. Able to perform further computations on results interactively (base on the mass data).
3. Easy to confront even a large amount of complex computations
What would you great expert recommend?
Thanks in advance.
 
#4
thank you, Dason
I spend a whole day to study R, It's powerful, I think It just seems like what I want. And I find A extend tools of R, named RExcel. I have a question about it, can you give me some tips? My question is:
Can I refer a result by cell name in RExcel? What I mean is: when I write "c(1,2)" or other format code in cell A1, Can I get a furter result in A2 by write "sum(A1)". Please notice that I want refer c(1,2) by cell name A1. I konw the traditional usage is: write "c(1,2)->M" in A1, and "sum(M)" in A2, but This traditional usage is not what I want.
thanks for your further replay.
 

Dason

Ambassador to the humans
#5
I've never used RExcel. I just do everything within R. If you're looking for a nicer platform to run R in I can recommend RStudio.
 
#6
hi, Dason
Forget RExcel, It is useless.
I have studied R for all my day since you had suggested it to me. I have learned how to use split, aggregate and so on.
I met a problem, I think It maybe easy to you, Please give me a hand. I have spend much time to compute these simple comutation, but I stick at group and relative position computation.
Please check for me how to write these steps in R code.
--------------------------------------
original data a data frame, named "result"
product year value
1 2009 1104
2 2009 608
3 2009 606
1 2010 1504
2 2010 508
3 2010 1312

I want to compute the product whose annual sales values are all among the top 2. so I need to do
step1: split data by year
2009
product year value
1 2009 1104
2 2009 608
3 2009 606
2010
product year value
1 2010 1504
2 2010 508
3 2010 1312

step2: compute rank in every group
2009
product year value rank
1 2009 1104 1
2 2009 608 2
3 2009 606 3
2010
product year value rank
1 2010 1504 1
2 2010 508 3
3 2010 1312 2

step3: filter top2 in every group
2009
product year value rank
1 2009 1104 1
2 2009 608 2
2010
product year value rank
1 2010 1504 1
3 2010 1312 2

step4: compute intersection of every group
product
1

-----------------------------------------------------
My second problem
the original data is based on step1 of form2:
2009
product year value
1 2009 1104
2 2009 608
3 2009 606
2010
product year value
1 2010 1504
2 2010 508
3 2010 1312

I want to get relative difference in every group (It's a simple sample, has no actual business means, in business I want to solve link relative ratio problem)
that means: 0, 608-1104,606-608 and 0, 508-1504, 1312-508. I know in SQL It should use lag and lead function, but How to do the same thing in R.
What I want in result should be:
2009
product year value relative
1 2009 1104 0
2 2009 608 -494
3 2009 606 -2
2010
product year value relative
1 2010 1504 0
2 2010 508 -996
3 2010 1312 804
 
Last edited:

Lazar

Phineas Packard
#7
Let this get you started:
Code:
myData <- data.frame(product = rep(1:3, 2), year=rep(2009:2010, each=3),
                     value=c(1104,608,606,1504,508,1312))

myData$R <- do.call(c, tapply(myData$value, myData$year, 
                                  function(x) rank(-x)) )

myDataTops <- myData[which(myData$R<3),]
myDataTops <- split(myDataTops, myDataTops$year)

intersect(myDataTops[[1]]$product, myDataTops[[2]]$product)
Part 2:
Code:
f1 <- function(x){
                  y <- rep(0, length(x))
                  y[c(2:length(x))] <- diff(x)
                  y
                  }

myData$relative <- do.call(c,tapply(myData$value, myData$year,f1) )
 
#8
thank you Lazar
I have run the code and got correct result, I have studied every function you provided ,It spent me 5 hours, but I felt happy.
you awsome!
It's nearly perfect, but I want to improve it on like steps in my last post, I checked many document and find nothing.please give me some tips.
1. how to tread intersect of uncertain years.
Actually the number of years is uncertain, It maybe many years, not just 2 years. so I cant just use myDataTops[[1]] and myDataTops[[2]] in intersect function.
Maybe I should write a function myself, using "for" to access every year? But It seems stupid. Is there a easy way to do this?
2. how to split data at first , then to compute relative, (your computation is not base on a split date) (It seems more easy to understand, and is my original requirement). what I mean is:
split data at first:
Code:
       myData <- data.frame(product = rep(1:3, 2), year=rep(2009:2010, each=3),value=c(1104,608,606,1504,508,1312))
       myData$relative<-0
       split(myData,myData$year)
We can got temporary result:
Code:
       $`2009`
  product year value relative
1       1 2009  1104        0
2       2 2009   608        0
3       3 2009   606        0

$`2010`
  product year value relative
4       1 2010  1504        0
5       2 2010   508        0
6       3 2010  1312        0
what I want is how to write code to change it to below?
Code:
       $`2009`
  product year value relative
1       1 2009  1104        0
2       2 2009   608        -496
3       3 2009   606        -2

$`2010`
  product year value relative
4       1 2010  1504        0
5       2 2010   508        -996
6       3 2010  1312        804
3.how to split data at first then to rank ,then to filter,then to intersect.
a. split data at first (just like question 2),we got
Code:
       $`2009`
  product year value R
1       1 2009  1104        0
2       2 2009   608        0
3       3 2009   606        0

$`2010`
  product year value R
4       1 2010  1504        0
5       2 2010   508        0
6       3 2010  1312        0
b. how to change it to:
Code:
       $`2009`
  product year value R
1       1 2009  1104        1
2       2 2009   608        2
3       3 2009   606        3

$`2010`
  product year value R
4       1 2010  1504        1
5       2 2010   508        3
6       3 2010  1312        2
c. then how to filter top 2 to
Code:
       $`2009`
  product year value R
1       1 2009  1104        1
2       2 2009   608        2

$`2010`
  product year value R
4       1 2010  1504        1
6       3 2010  1312        2
d. then how to compute intersect to:
1

Let this get you started:
Code:
myData <- data.frame(product = rep(1:3, 2), year=rep(2009:2010, each=3),
                     value=c(1104,608,606,1504,508,1312))

myData$R <- do.call(c, tapply(myData$value, myData$year, 
                                  function(x) rank(-x)) )

myDataTops <- myData[which(myData$R<3),]
myDataTops <- split(myDataTops, myDataTops$year)

intersect(myDataTops[[1]]$product, myDataTops[[2]]$product)
Part 2:
Code:
f1 <- function(x){
                  y <- rep(0, length(x))
                  y[c(2:length(x))] <- diff(x)
                  y
                  }

myData$relative <- do.call(c,tapply(myData$value, myData$year,f1) )
 
Last edited:

Lazar

Phineas Packard
#9
I will try to get to all your questions but just quickly, I decided against spliting the data at first because I thought in your case it was easier to just work with the whole data set. It is still identical to what you wanted in the first place all that you need to do is split the data at the end if you so desire. i.e. run all the script I wrote and then
myDataList <- split(myData, myData$year).


this gives:
Code:
$`2009`
  product year value R relative
1       1 2009  1104 1        0
2       2 2009   608 2     -496
3       3 2009   606 3       -2

$`2010`
  product year value R relative
4       1 2010  1504 1        0
5       2 2010   508 3     -996
6       3 2010  1312 2      804
Which is what you want right?
 

Lazar

Phineas Packard
#10
1. how to tread intersect of uncertain years.
Actually the number of years is uncertain, It maybe many years, not just 2 years. so I cant just use myDataTops[[1]] and myDataTops[[2]] in intersect function.
Do you mean all the combinations? In other words the intersect between 2009 and 2010, 2009 and 2011, and 2010 and 2011? If this is the case then:
Code:
#Gives all the possible combinations between the years
combs <- expand.grid(1:length(myDataTops), 1:length(myDataTops))
#Gets rid of comparisons with own year
combs <- combs[!(combs[,1] == combs[,2]),]
#Gets rid of duplicate combinations
combs <-   combs[!duplicated(t(apply(combs, 1, sort))), ]

sapply(1:nrow(combs), function(x){intersect(myDataTops[[combs[x,1]]]$product, 
                                                    myDataTops[[combs[x,2]]]$product)
                                                   }
          )
Will give you what you want but the labelling is not really ideal of the output is not really clear.
 
#11
thanks for quick reply.
No, what I mean is not all the possible combinations.
I mean there maybe many years, e.g. 2008,2009,2010,2012
if I use intersect , I must code as below:
result1<-intersect(myDataTops[[1]]$product, myDataTops[[2]]$product)
result2<-intersect(result1, myDataTops[[3]]$product)
result3<-intersect(result2, myDataTops[[4]]$product)

to avoid this handword, I must use a for loop to tread random multiple years ( I have not studied loop statement)
But loop statement seems a stupid way, there should be a single function or easy way to tread with thses situation, shouldn't it?

Do you mean all the combinations? In other words the intersect between 2009 and 2010, 2009 and 2011, and 2010 and 2011? If this is the case then:
Code:
#Gives all the possible combinations between the years
combs <- expand.grid(1:length(myDataTops), 1:length(myDataTops))
#Gets rid of comparisons with own year
combs <- combs[!(combs[,1] == combs[,2]),]
#Gets rid of duplicate combinations
combs <-   combs[!duplicated(t(apply(combs, 1, sort))), ]

sapply(1:nrow(combs), function(x){intersect(myDataTops[[combs[x,1]]]$product, 
                                                    myDataTops[[combs[x,2]]]$product)
                                                   }
          )
Will give you what you want but the labelling is not really ideal of the output is not really clear.
 
#12
hi, Lazar
thanks for your help. I think I didn't describe clearly and I make some wrong description in my reply, so you completely misread my intentions.
The result you posted is just what I want. But I didn't just want a final result, I want a natural( I mean thinking things directly and simplely ) and easy way to analyse data for a analyst. I think these steps is more natural ("Part" 1 in your reply):
1. split data to some groups by year.
2. get rank in each group. ( for further analysis, I can run computations like max, min based on these step).
3. get top 2 in each group. ( so I can observe importent data by group, It's more clear than a non-splited data).
4. get intersect of every group. here I solved problem:the product whose annual sales values are all among the top 2.

next, I want to compute relative difference of every record in each year( "part2" in your reply) , I found splited data still there in step1 above , so I can make a further compution based on step1 above( need not rewrite it, It seems call "code reuse"?). the nature thought is to directly add a relative value in each group based on step1 above , and not to start from non-splited original data.
did I make it clear? Sorry for my poor expression ability.


I will try to get to all your questions but just quickly, I decided against spliting the data at first because I thought in your case it was easier to just work with the whole data set. It is still identical to what you wanted in the first place all that you need to do is split the data at the end if you so desire. i.e. run all the script I wrote and then
myDataList <- split(myData, myData$year).


this gives:
Code:
$`2009`
  product year value R relative
1       1 2009  1104 1        0
2       2 2009   608 2     -496
3       3 2009   606 3       -2

$`2010`
  product year value R relative
4       1 2010  1504 1        0
5       2 2010   508 3     -996
6       3 2010  1312 2      804
Which is what you want right?
 

Lazar

Phineas Packard
#13
You can do it your way but it is very hard to effeciently define new varaibles when you have your dataframes in a list. My understanding is that it requires a for loop to do so. Surely the most natural way to analyses the data is that which gives you your answer efficently?
 
#14
Thanks for your professional attitude
The most natural way to analyses the data is ideal, but I shuold give it up if It's very hard to write the code. If I change the steps as below, Can It make the code more easy than my original steps?
1. copy the original data: myDataRank<-myData
2. add a 0 value column (named R): myDataRank$R<-0
3. split data by year: split(myDataRank,myData$year)
4. get rank in each group, assign rank to column R.
5. get top 2 in each group
6. get intersect of every group. It may as well do not use a loop here.
please give some tips how to write code for step 4,5,6

next
1. copy the original data: myDataRelative<-myData
2. add a 0 value column (named relative): myDataRelative$relative<-0
3. split data by year: split(myDataRank,myData$year)
4. compute relative difference,assign it to column relative.
please give some tips how to write code for step 4

The more close to natural, the better, if code is not very hard.
You may take notice of that is my way.
I'm sorry to bring your some trouble, hopes not bother you.


You can do it your way but it is very hard to effeciently define new varaibles when you have your dataframes in a list. My understanding is that it requires a for loop to do so. Surely the most natural way to analyses the data is that which gives you your answer efficently?