Handling very large (21 million rows) arrays in R

sak

New Member
#1
I have an array of size 21,415,715 X 18 which I am trying to import through ODBC database. But R is not able to load it.

How should I handle it? Or is R not good enough for such massive data?
 

trinker

ggplot2orBust
#2
sak,

1) I've seen you get on here many times and post questions about R. Someone usually solves your problem but you never let us know if the solution worked. That means future people searching the same topic can't make definitive judgments and the searchability and self-help of threads is diminished.

2) Many of your questions, including this one are easily searchable through Rseek.org which you are not ignorant to because I've informed you of it in previous posts. Many people work painstakingly to compile and organize this information to empower the R user to help themselves. Please take the time to attempt to answer your own question first.

How should I handle it? Or is R not good enough for such massive data?
3) This statement is loaded and has a combative feel to it. Alot of people have freely invested and continue to freely invest alot of time into this open source program that you seem to use frequently. This kind of statement lacks sensitivity and can be pretty offensive. I myself have not contributed anything to [R] but am grateful to those who have and continue to develop the program and libraries.

Perhaps a better way of stating this would have been:

"I'm not sure if R is able to handle this large of a data set. I've searched and not been able to find an answer. Does anyone have an idea of how R may be able to handle such a large data set?"

Though I do not have the need to work with such a large data base I think The RSQLite may help you. I have not tried this just found it with an Rseek search. I've heard others use means of working with large datasets by not actually importing the data set into memory, instead working with the data set outside of R and tearing apart just what is needed.
 

TheEcologist

Global Moderator
#3
In general R relies on your memory to handle data (or objects). Which is normally no problem except when you have a particularly large data sets R is at a drawback compared to software from the mainframe era as e.g. SAS.
Luckily people have run into this issue before and there are solutions. Like the filehash package, which stores large datasets on the hd, another way (since you are already using ODBC) is to let mySQL handle the dataset and send output to R.
I found this manual very useful.

Hope this helps.
 

TheEcologist

Global Moderator
#4
trinker,

I did not see your post in time, but yes you are correct.

Sak you have been kindly warned by trinker, please stick to the guidelines.

We are happy to help but please take note of what trinker says.
 

sak

New Member
#5
Sorry guys!

99.9% of time I have done strong research on google before I come here. I come here for better answer or if I find a thread which says that solution does not exist. But I would mention next time what I found as an answer before posting my query. Fair enough.

Plus I am just not looking for a way to do something rather why I should do it in a certain way. I love the way experts shed light on simple stuff and argue about different ways things can be done. Again I would mention that as well.

For this large arrays problem, I came across a thread which basically said we can't do it since the maximum size of integer in R is 2^32 - 1 and R indexes the elements of array as integers. So, it is not the memory that is the problem but the way R does things.
 

Dason

Ambassador to the humans
#6
Like the others have said - R has some issues with large data because it relies on the machines memory but there are workarounds. Sometimes you can get away without needing some of those workarounds though depending on what you plan on doing with the data. So what do you need to load such a large dataset for? What do you plan on doing with it?
 

sak

New Member
#7
Like the others have said - R has some issues with large data because it relies on the machines memory but there are workarounds. Sometimes you can get away without needing some of those workarounds though depending on what you plan on doing with the data. So what do you need to load such a large dataset for? What do you plan on doing with it?
I need to run regression and do some other time series stuff which are not available in SQL.
 

bryangoodrich

Probably A Mammal
#8
From my experience and readings, usually you don't need to deal with a large dataset in whole in your analysis. I'm curious about the filehash package Ecologist mentioned, since I've never run across it yet. Nevertheless, what do you need to do with the large dataset? Are you literally just running a regression on it like Y ~ X1 + X2 + ... ? Or are you breaking the dataset up? Usually the latter is the case, in which case appropriate SQL should help you get the data you require for the analysis, and you can load more manageable content. If not, then you might consider managing the regression operations manually. The matrix algebra is rather easy in R, and this can allow you to do it in a manageable way with a bit of creativity. I think we'd have to know your workflow in a bit more detail, though.
 

bryangoodrich

Probably A Mammal
#9
Hm, so if I understand what I read (quickly) correctly, filehash basically points to a database (on disk), and lets you basically operate with its content in the environment (or as an environment) like it were an object in memory. That is awesome! There's also a filehashSQLite that does the same thing but uses SQLite as the database backend. I think I might use that since I want to become proficient with SQLite, and I think its technology is just really neat and convenient.