I'm working on a small project to distinguish if a random number was generated by a man or a machine.
The method i'm using is checking for duplicate series depending on the length of the number entered.
The series i'm checking for are the 2 digits, 3, 4, 5, 6 and 7. Beyond that it get's pretty ridiculous.
What I come up with so far:
when entering a number of 100 digits length. Len = 100
when counting 2 digits duplicates, there is the range of [00 ~ 99] which are 100 items.
in a Len 100 there can only be 99 items of 2 digit numbers.
What we need to do now is to find the duplicates of these 99 items by checking them by themselves.
To do that, we will have (99 * (99 - 1)) / 2 = 4851 unique pairs to match.
Now, I did find out a formula that can tell me exactly the number of matches for any given Length and Items. For this example we need:
Length = 100
Number of Digits (Digits) = 2
Items = 10^digits = 10^2 = 100
Edit length (E.L) = Length - (Digits -1) = 100 - 1 = 99
Formula is: (E.L * (E.L - 1)) / (2 * items) = (99 * 98)/(2 * 100) = 48.51
This mean that in any given 100 truly random digits, there should be 48.51 2 digits matches between pairs.
if we apply this to 3 digits then the formula will be
(98 * 97)/(2 * 1000) = 4.753 items.
and for 4 digits it will be: (97 * 96)/(2 * 10000) = 0.4656
I've run multiple simulations and those numbers are correct, however i'm trying to create the maximums that a simulation can give but I'm not that good with statistics.
so with 100 sample size I got the following for 2 digits:
Mean: 48.63
Std. Dev: 6.43
std dev range [42.2 - 55.1]
Sample Min: 36
Sample Max: 67
Now obviously I got the min and max from the simulation and 100 samples isn't enough. Can I create a formula that will get me a reasonable range to calculate?
When I asked random people to generate a random 100 digit number I got:
- 85 matches for 2 digits series check vs 48.51 average
- 20 matches for 3 digits series check vs 4.75 average
- 3 matches for 4 digits series check vs 0.47 average
- 1 match for 5 digits series check vs 0.0456 average
Statistically speaking these results are very rare and thus can conclude that a human has entered them
What I'm stuck at is figuring out the maximums for the random formula and as an additional step I also need to calculate the score of how far off the sample is from the pool.
The method i'm using is checking for duplicate series depending on the length of the number entered.
The series i'm checking for are the 2 digits, 3, 4, 5, 6 and 7. Beyond that it get's pretty ridiculous.
What I come up with so far:
when entering a number of 100 digits length. Len = 100
when counting 2 digits duplicates, there is the range of [00 ~ 99] which are 100 items.
in a Len 100 there can only be 99 items of 2 digit numbers.
What we need to do now is to find the duplicates of these 99 items by checking them by themselves.
To do that, we will have (99 * (99 - 1)) / 2 = 4851 unique pairs to match.
Now, I did find out a formula that can tell me exactly the number of matches for any given Length and Items. For this example we need:
Length = 100
Number of Digits (Digits) = 2
Items = 10^digits = 10^2 = 100
Edit length (E.L) = Length - (Digits -1) = 100 - 1 = 99
Formula is: (E.L * (E.L - 1)) / (2 * items) = (99 * 98)/(2 * 100) = 48.51
This mean that in any given 100 truly random digits, there should be 48.51 2 digits matches between pairs.
if we apply this to 3 digits then the formula will be
(98 * 97)/(2 * 1000) = 4.753 items.
and for 4 digits it will be: (97 * 96)/(2 * 10000) = 0.4656
I've run multiple simulations and those numbers are correct, however i'm trying to create the maximums that a simulation can give but I'm not that good with statistics.
so with 100 sample size I got the following for 2 digits:
Mean: 48.63
Std. Dev: 6.43
std dev range [42.2 - 55.1]
Sample Min: 36
Sample Max: 67
Now obviously I got the min and max from the simulation and 100 samples isn't enough. Can I create a formula that will get me a reasonable range to calculate?
When I asked random people to generate a random 100 digit number I got:
- 85 matches for 2 digits series check vs 48.51 average
- 20 matches for 3 digits series check vs 4.75 average
- 3 matches for 4 digits series check vs 0.47 average
- 1 match for 5 digits series check vs 0.0456 average
Statistically speaking these results are very rare and thus can conclude that a human has entered them
What I'm stuck at is figuring out the maximums for the random formula and as an additional step I also need to calculate the score of how far off the sample is from the pool.