# "which file should be deleted ?" using statistics theory

#### statbz

##### New Member
Hi,

files on my computer. Since I'm poor I don't have budget to upgrade my disk.

Thus, I established a database to record the number of time of access and last access date for every files.

One day, my disk is running out of space. I have no option but to delete user's files.
Which file should I delete and "most likely" I wouldn't be blamed ?

Base on what statistic model should I use, inorder to make my decision ?
Do I need to record some more information ? such as the dates for every time the files are accessed.

thanks!

#### jessica01

##### New Member
It's reasonable to assume that the number of times a file is accessed in a given interval of time (day, week, whatever) follows the Poisson distribution. http://en.wikipedia.org/wiki/Poisson_distribution

The distribution is defined by a single parameter -- expected rate lambda. The MLE estimator of lambda is the average rate. So, for each file, you can have two counters:

num = total number of times the file has been accessed
denom = number of days that the file has been in existence

At the end of each day, increase num by the number of times the file was accessed that day and increase denom by 1. Estimated lambda = num / denom.

The simplest thing to do is to delete the file with the lowest lambda.

For the first however many days of a file's existence, the lambda estimate will not be accurate. See the Wikipedia article for confidence interval estimation. So you should not delete any files that are younger than some number of days (let's say, 30 or 50).

#### Dason

I don't think how often a file is read is the only thing that needs to be taken into account.

Consider these two files:
ThrowAwayFileUsedForJottingDownQuickThoughtsButIsn'tVeryImportant.txt