Hi everyone
My statistics skills have gotten a bit rusty, so please forgive me if I do not formulate everything properly.
I am currently facing the following problem: I am working on a central data base for a number of 10 shops. Each shop has 200 clients records.
These 2000 (200 records * 10 shops) client records do belong to 1000 unique clients, implying an average number of records per client of 2. I do know that every shop only has unique client records, so there is no duplication within the 200 records that each shop has.
We will now go and onboard one shop after another to the central data base and I am trying to show how the probability of "drawing" a duplicate changes over time.
My first attempt looks as follows, but I'm not sure whether the math on it is correct:
1st shop onboarding:
Probability of the records being new to data base: 100%
Probability of the records being a duplicate: 0%
--> unique records on average: 200
--> duplicate records on average: 0
2nd shop onboarding:
Probability of the records being new to data base: (1000-200)/1000=80%
Probability of the records being a duplicate: 1-80%=20%
--> unique records on average: 200+160 = 360
--> duplicate records on average: 0 + 40 = 40
Variance = (20% * 80% )/1000 = .00016
3nd shop onboarding:
Probability of the records being new to data base: (1000-360)/1000=64%
Probability of the records being a duplicate: 1-64%=36%
--> unique records on average: 200+160+128= 488
--> duplicate records on average: 0 + 40 + 72= 112
Variance = (64% * 36% )/1000 = .00023
and so on...
I am sure this isn't exactly a tricky problem. However, I would highly appreciate if someone could have a quick look and let me know whether I'm on the right track with this or if there is any major mistakes in my calculations.
Many thanks in advance.
Best
Handrix
My statistics skills have gotten a bit rusty, so please forgive me if I do not formulate everything properly.
I am currently facing the following problem: I am working on a central data base for a number of 10 shops. Each shop has 200 clients records.
These 2000 (200 records * 10 shops) client records do belong to 1000 unique clients, implying an average number of records per client of 2. I do know that every shop only has unique client records, so there is no duplication within the 200 records that each shop has.
We will now go and onboard one shop after another to the central data base and I am trying to show how the probability of "drawing" a duplicate changes over time.
My first attempt looks as follows, but I'm not sure whether the math on it is correct:
1st shop onboarding:
Probability of the records being new to data base: 100%
Probability of the records being a duplicate: 0%
--> unique records on average: 200
--> duplicate records on average: 0
2nd shop onboarding:
Probability of the records being new to data base: (1000-200)/1000=80%
Probability of the records being a duplicate: 1-80%=20%
--> unique records on average: 200+160 = 360
--> duplicate records on average: 0 + 40 = 40
Variance = (20% * 80% )/1000 = .00016
3nd shop onboarding:
Probability of the records being new to data base: (1000-360)/1000=64%
Probability of the records being a duplicate: 1-64%=36%
--> unique records on average: 200+160+128= 488
--> duplicate records on average: 0 + 40 + 72= 112
Variance = (64% * 36% )/1000 = .00023
and so on...
I am sure this isn't exactly a tricky problem. However, I would highly appreciate if someone could have a quick look and let me know whether I'm on the right track with this or if there is any major mistakes in my calculations.
Many thanks in advance.
Best
Handrix