Meanings of a set of data

#1
Hi,

My name is ngungo, means innocent or ignorant, and I am new here. I am a retired Computer Process Control Engineer. I had few statistics courses in school and had applied some at work long time ago and now forgot all about it. I guess I can hit the books again to refresh but I need some hints. Be warned that I don't even know how or what to ask.

Here is my problem. I have given a huge set of data that tabulated in rows and columns. I need to understand it, read it, comprehend it. I need it tells me something. What can I do? What procedures or formulas? I heard there is some statistics tools like black box that you just feed the data then it tell you something. Is such a thing exists? By the way I am pretty good at numerical programming.

I don't know what else to ask or maybe you ask and guide me to right direction, please. Thanks in advance.

--ngungo
 
#2
I spent all day today learning more about what I have no clue what-so-ever about. The SAS looks impressive but the final tool I chose was R language for the reason that I am a programmer and it's cheap. I'd appreciate if someone says something.
 
#3
I downloaded the R-2.15.0 and R-intro book. The R language is very interesting. I just finish the Appendix A: A sample session. It's a productive Sunday.
 

Dason

Ambassador to the humans
#5
R is awesome and I'm glad to hear you got it all installed and running. I'm not really sure what you meant by your last post though. If you have specific R questions feel free to post a new thread asking about them.
 

noetsi

Fortran must die
#6
Before you use a method, or a statistical software, to answer this question - you have to decide why you are interested in the data. What is your research question? It is very rare, and probably impossible, to just look at data and decide something with it unless you have some purpose in mind before you start.

As someone pointed out to me, if you have actual variables in your rows or columns, one easy place to start is with EFA. Exploratory factor analysis.
 
#7
@Dason: I sure will. Thanks.

@noetsi:
Few weeks ago I met an old colleague and we had some beer. I complained to her that my life was dull and I missed the good ol' days at the company. A week ago I heard from her again offering a small commission. She threw at me a good size table of numbers and asked me to make sense out of it. She did not know I don't know jack about statistics though she knew I was in the process control field and I did not volunteer to tell. :)

At the moment, I try to get acquainted with R and to relearn Introduction to Statistics. I am also to find out what Exploratory factor analysis is about now. Thanks for the hints. I got my life back.
 
#8
“I have given a huge set of data that tabulated in rows and columns.”
How many observations (rows) and how many variables (columns) do you have?

As a first check of the data I normally run through for all variables the number of observations (n), the mean, the standard deviation and minimum and maximum.

Is the “n” reasonable? And the mean and standard deviation? Looking at min and max sometimes show you if there are any incorrect values. You might know that some variables must be within a certain range; larger than zero for example. Sometimes there are missing value coded as “999” or something similar. Sometimes (from excel I believe) missing values are coded as zero (0). There are all kinds of cleaning and checking that needs to be done.

This (data quality checking) is often forgotten. I believe that many (bad) decisions in companies are based on simply incorrect data.

(If you suspect that a few variables are incorrect you need to go back to the client and ask for corrections.)

Then when you have a rough idea of the level (the means) of your data, then it is good to look at the distributions of the variables. Boxplots and histograms are useful.

You will soon be overwhelmed by the amount of result (if there are more than five variables). You will need save your result and document which code create what output. (Save date of creation.)

Then you can look at relations - is one variable related to another? Scatter plots are good. You can also calculate correlation matrixes between variables. (I use most of the time Pearson product moment correlation – the “usual one”.) Don’t try to print to many at a time, but maybe you can have space on the page for 10 by 10 variables.

Then later on you can look at if the means are different for different groups (females or males for example).

Your study is exploratory. You want to explore what kind of information might be hidden in the data material. Almost all investigations are exploratory to some extent. Very few investigations just have a “hypothesis” to test. The type of: “What is your research question?”

I would suggest you to use exploratory factor analysis only as a last option. I suggest you to not use it at all. One reason is that, if you don’t know what it is then your listeners probably don’t know it either. Another reason is that it is statistically controversial. Some statisticians recommend not using it at all.



There are many information pages about how to learn R. On this site: on R/Splus look at: Info for R users.

“I got my life back.”
Welcome back! :) And welcome back here and tell us about your improvements!
 
#9
@GretaGarbo: What a treasure! Thanks so much.

Your response is exactly what I am looking for. :) Procedure, Procedure, Procedure :). As I said, my discipline is engineering and computer science, that's why procedure. So now my understanding is much clearer, last night, I have made a list of tasks and then will proceed accordingly. To answer your question, the table consists of no less than couple dozens of variables and thousands of observations. It will be months of work. It's fantastic.

Reading your advice, unless you tell me otherwise, it seems I just need to be R sufficient in file read and write, functions min, max, mean, standard deviation, and boxplot histogram; and later on Pearson product moment correlation. Except for the Pearson thing that I hope you will give me some more hint later on, I think I just need to install R, an R editor, some R graphic package. I cut out 10 days to do these and also Statistics book reading. That will be due at the weekend of Father's Day. What a fantastic gift.

Thanks so much!
 

noetsi

Fortran must die
#11
@Dason: I sure will. Thanks.

@noetsi:
Few weeks ago I met an old colleague and we had some beer. I complained to her that my life was dull and I missed the good ol' days at the company. A week ago I heard from her again offering a small commission. She threw at me a good size table of numbers and asked me to make sense out of it. She did not know I don't know jack about statistics though she knew I was in the process control field and I did not volunteer to tell. :)

At the moment, I try to get acquainted with R and to relearn Introduction to Statistics. I am also to find out what Exploratory factor analysis is about now. Thanks for the hints. I got my life back.
Awesome. I would be lost without running data.
 
#12
1. The Schaum's book came
2. Re-installed R per G. Jay Kerns

3. Installed NppToR:
+ I've been a fan of and using Notepad++ since forever :)
+ Kicked the tire and like it. Nice integration.
 
Last edited:
#16
Thanks Dason!
RStudio is much nicer. I am glad that I mentioned about NppToR so to know RStudio.

PROGRESS:
1. The Schaum's book came
2. Re-installed R per G. Jay Kerns
3. Installed NppToR:
+ I've been a fan of and using Notepad++ since forever
+ Kicked the tire and like it. Nice integration.


4. Installed RStudio, Uninstalled NppToP.
 
#17
“the table consists of no less than couple dozens of variables and thousands of observations.“


So, you have ”thousands of observations”. That’s good! Then you can rely on “large sample statistics” (e.g. that estimates tend to be approximately normally distributed). It is actually more difficult if you just have seven observations. Still, your observations number is not so large (as millions of data) that make it cumbersome to handle.

And the number of variables, how many are they? How many of them are values that are meaningful to calculate means on (like chemical concentration, or customer satisfaction or something) and how many are classification variables (like: female, male; young, middle, older)

“It will be months of work.“
I think you can do many of these things faster. Installation is quite fast.

I think it is good if you create space on the disc for a rich directory structure. What you do now in the first week might seem almost childish after a while when you have learnt more. I find it normal that code is re-used and improved and sometimes a new version can be put in a new directory. It is also important to document results and set dates when things were created and so on. Comment code why you are doing each step. Since you ngungo are very experienced I am sure you are aware of many of these things, but for inexperienced readers it might be worth saying.
 
#18
Thank you very much GretaGarbo. :)

I have few obligations today, making breakfast for whole family, doing laundry, checking on transmission oil, watching French Open, but I need to get going with R. Last night was the first time I, by my own ability, to write a few lines of code. Eager but frustrated newbie like a Chinese soul getting lost in Mexico City.

I have few questions about R though.
Code:
> num <- read.table(file, quote="\"")
> View(num)
> num
  V1 V2 V3 V4 V5
1  1  2  3  4  5
2  6  7  8  9 10
3 11 12 13 14 15
4 16 17 18 19 20
> num[1]
  V1
1  1
2  6
3 11
4 16
> num[,1]
[1]  1  6 11 16
> row2 <- c(num[2,1],num[2,2],num[2,3],num[2,4],num[2,5])
> row2
[1]  6  7  8  9 10
1. How do I extract a row without concatenate function?
2. How do I add a new column to num?
3. How do I add a new row to num?
 
#19
This is what I can think of adding a column and a row to an existing data frame. Comments, please.
Code:
> num <- read.table(file, quote="\"")
> View(num)
> num
  V1 V2 V3 V4 V5
1  1  2  3  4  5
2  6  7  8  9 10
3 11 12 13 14 15
4 16 17 18 19 20
> num <- data.frame(
+   v1=c(num[,1],1), 
+   v2=c(num[,2],1), 
+   v3=c(num[,3],1), 
+   v4=c(num[,4],1), 
+   v5=c(num[,5],1), 
+   v6=c(num[,4]+num[,5],1)
+ )
> num
  v1 v2 v3 v4 v5 v6
1  1  2  3  4  5  9
2  6  7  8  9 10 19
3 11 12 13 14 15 29
4 16 17 18 19 20 39
5  1  1  1  1  1  1
 

Dason

Ambassador to the humans
#20
Code:
num <- as.data.frame(matrix(1:20, ncol= 5, byrow=T))

# Add a column
num[,6] <- seq(9, 39, 10)

# Add a row
num[5,] <- 1