I am new to big data and I am trying to decide on a learning project to do.

I have a imaginary "new" widget design out in the field being used on trial basis. I get failure reports including:

Location of failures ... zone 1 to 8

observed breakage.... 1 to 10 (crack, burn, chip ...)

hours in operation

Type of fuel used (1 to 3)

..

and so forth. There are 10 features

The Y I am trying to predict is Root Cause (which a set of options, 1 to 5) and historically developed by experts.

I have ~ 1500 values (features vectors) and the goal is to use the set to train, test and come up with a model.

Question:

in the absent of any acquirement to choose one method over the other ... what would be the best ML approach to predict Root cause based on past observations? Random Forest, Neural Network, ...

I would appreciate any recommendation and the reason why a given method would be better. My goal is to do complete this using 2 methods ... mainly for comparison and learning. I will be using R

Thanks

Mike ]]>

I want to conduct correlation between two variables, that have common term (difference scores).

Originally there are three variables X, Y, Z. I want to conduct correlation between two variables that are computed as:

First variable: X - Y

Second variable: Y - Z

I know, that the correlation will be spurious and I wonder whether it is possible to correct it in some way. From theoretical reasons, I need correlation (or some other measure of relation) between variables that have this common term.

I will add also, that what I am interested is not the value of correlation coefficient, but I want to compare relative strength of correlation between these variables in different conditions. That is why I thought that maybe I can transform r values to Z values using Fisher's transformation, because I assumed that even if values of correlation coefficents will be inflated, it still would be possible to compare they relative strength.

I will be greatefull for help. ]]>

I have a question. I have a dataset of a few customer segments, their customers, the products they bought and the margin at the moment of transaction. I want to find out if my segmentation works. Do I get better margins in segment A vs segment B for similar products. What is the best statistical approach I can take?

Thank you for the help! :) ]]>