"which file should be deleted ?" using statistics theory


Let's say I'm a poor disk provider. I allow people to upload and remote access to
files on my computer. Since I'm poor I don't have budget to upgrade my disk.

Thus, I established a database to record the number of time of access and last access date for every files.

One day, my disk is running out of space. I have no option but to delete user's files.
Which file should I delete and "most likely" I wouldn't be blamed ?

Base on what statistic model should I use, inorder to make my decision ?
Do I need to record some more information ? such as the dates for every time the files are accessed.

It's reasonable to assume that the number of times a file is accessed in a given interval of time (day, week, whatever) follows the Poisson distribution. http://en.wikipedia.org/wiki/Poisson_distribution

The distribution is defined by a single parameter -- expected rate lambda. The MLE estimator of lambda is the average rate. So, for each file, you can have two counters:

num = total number of times the file has been accessed
denom = number of days that the file has been in existence

At the end of each day, increase num by the number of times the file was accessed that day and increase denom by 1. Estimated lambda = num / denom.

The simplest thing to do is to delete the file with the lowest lambda.

For the first however many days of a file's existence, the lambda estimate will not be accurate. See the Wikipedia article for confidence interval estimation. So you should not delete any files that are younger than some number of days (let's say, 30 or 50).


Ambassador to the humans
I don't think how often a file is read is the only thing that needs to be taken into account.

Consider these two files:

Now... forgetting for a moment how stupid it would be to have a masterpassword list as a txt file openly accessible to anybody that wanted to read it... I could see the masterpassword file as having a much higher importance than the throw away file even though the masterpasswordlist is probably never accessed - the one time you would want to access it would be quite important. So I think there would be need to be some sort of importance associated with the files as well.


Fortran must die
Another thing that would be taken into account would be the priority assigned to the user. For example (I use a business example although this is not one I know) a file with a senior executive would be given priority over someone with less senority in the system in not getting deleted regardless of the number of times a file was added etc. The same is true of an emergency provider.