managing very large dataset.

#1
I am currently using a large dataset of around 30million observations (around 30GB) and Stata is getting very slow in implementing any simple command. it took for ex one hour to merge two parts of this dataset with the append command. any idea on why it is so slow? or on how to set stata so that it uses its memory more efficiently? I'm new to this forum but I hope someone can help me out!
 

bukharin

RoboStataRaptor
#2
Stata keeps its entire dataset in memory. This is usually an advantage (makes it run very fast), but occasionally a disadvantage when your dataset is too big to fit into memory - then it uses virtual memory and runs very slowly.

The first step is to reduce the size of your dataset(s):
- -drop- any variables you don't need for your analysis. If you will be running more than one analysis, create subsets of your data for each analysis so that each one only contains the variables needed for that specific analysis
- when you're combining datasets make sure that they have the same variable names. For example if one had "id" and the other had "ID" then you'll end up with 2 variables instead of 1, wasting space. Prepare the datasets prior to combining them such that each dataset contains exactly what you need (and nothing more) in exactly the same format
- convert any categorical string variables to numeric, then -drop- the original string variable. For example a string variable containing "Male" or "Female" takes up 6 bytes, but it only needs to be stored as a numeric variable taking up one byte
- -compress- the datasets to store each variable in its smallest type (eg a binary variable can be stored as a byte, it doesn't need to be stored as a float; see -help compress-)

Using the above tips can often reduce your dataset's size several-fold. If that's not enough you could also consider:
- analysing a subset of your data - do you really need 30 million observations?
- adding more RAM to your computer - but you'll probably need to be using a 64 bit operating system to use it
- using software other than Stata - SAS is arguably the best alternative for analysing a massive dataset. SAS will also be slow, but unlike Stata it won't grind to a halt with a huge dataset.
 
#3
Hello,

I am busy with my master thesis and at the moment I am gathering the data that I need from crsp and compustat/crsp merged. I am using data from jan 1973 to december 2013. I would like to merge those datasets but I am having problems by doing.

The crsp file is around 700mb and the compustat-crsp merged file 350mb.

When I try to merge those datasets Stata keeps saying:

op. sys. refuses to provide memory
Stata's data-storage memory manager has already allocated 1664m bytes and it just attempted to allocate another
16m bytes. The operating system said no. Perhaps you are running another memory-consuming task and the command
will work later when the task completes. Perhaps you are on a multiuser system that is especially busy and the
command will work later when activity quiets down. Perhaps a system administrator has put a limit on what you
can allocate; see help memory. Or perhaps that's all the memory your computer can allocate to Stata.
r(909);

I am doing this on my laptop. Maybe is my laptop to slow or something so I tried the computers at the university too but the same message pops up. I tried to downsize the dataset a bit by removing the sic codes that I don't need and to align most of the variable names as mentiod by bukharin in the post above.

Maybe I should first convert the monthly crsp observations to quarterly compustat-crsp observations and remove some of the monthly oberservations so it takes up less space? Yanisomari (starter of this thread) is talking about 30gb and I am only talking about approximately 1050mb. With a smaller dataset( 2012 to 2013) the merge does work.

I just do not know what to do

Kind regards,

Robert