Random Sampling - Missing Something Easy and I know it

#1
Hello everyone, I was asked to look over some data that came across my desk and I was stumped as to why it came out the way that it did.

So the dev team was sampling data from two sources and came across increasing percentages in overlapping data. The objective was to find overlapping followers on a social media platform for two unrelated entities. The entire group of followers was taken and then samples were taken from Entities A, B, C, D and compared for overlap. There was some panic when the results came about and I am hoping you guys can offer some insight.

What I have struggled to wrap my head around is the steady increase in percent of overlap when going up from 12.5 to 25 to 50 percent. I'm sure there is something I am missing here but I thought I would come ask because I think there should be a reason for this besides doubling the random sample size of both groups. Data is below and please let me know if you have questions, I'm just looking for a reason for the overlap and steady, predictable increase.

+ ----- + -------------------------------------------- + -------------- + -------------------------------------------------------------- +
| 12.5% | Entity A (884549), Entity B (293088) | Count: 4966 | Percent of Entity A followers that follow Entity B 0.57% |
| 25% | Entity A (1769098), Entity B (586176) | Count: 19995 | Percent of Entity A followers that follow Entity B 1.14% |
| 50% | Entity A (3538196), Entity B (1172352) | Count: 80623 | Percent of Entity A followers that follow Entity B 2.28% |
| 100% | Entity A (7076391), Entity B (2344704) | Count: 321584 | Percent of Entity A followers that follow Entity B 4.55% |
+ ----- + -------------------------------------------- + -------------- + -------------------------------------------------------------- +
| 12.5% | Entity B (293088), Entity C (4590578) | Count: 23558 | Percent of Entity B followers that follow Entity C 8.04% |
| 25% | Entity B (586176), Entity C (9181155) | Count: 94900 | Percent of Entity B followers that follow Entity C 16.19% |
| 50% | Entity B (1172352), Entity C (18362309) | Count: 378686 | Percent of Entity B followers that follow Entity C 32.31% |
| 100% | Entity B (2344704), Entity C (36724618) | Count: 1509755 | Percent of Entity B followers that follow Entity C 64.4% |
+ ----- + -------------------------------------------- + -------------- + -------------------------------------------------------------- +
| 12.5% | Entity A (884549), Entity C (4590578) | Count: 33842 | Percent of Entity A followers that follow Entity C 3.83% |
| 25% | Entity A (1769098), Entity C (9181155) | Count: 135480 | Percent of Entity A followers that follow Entity C 7.66% |
| 50% | Entity A (3538196), Entity C (18362309) | Count: 540018 | Percent of Entity A followers that follow Entity C 15.27% |
| 100% | Entity A (7076391), Entity C (36724618) | Count: 2161725 | Percent of Entity A followers that follow Entity C 30.55% |
+ ----- + -------------------------------------------- + -------------- + -------------------------------------------------------------- +
| 12.5% | Entity D (225004), Entity A (884549) | Count: 2094 | Percent of Entity D followers that follow Entity A 0.94% |
| 25% | Entity D (450007), Entity A (1769098) | Count: 8094 | Percent of Entity D followers that follow Entity A 1.8% |
| 50% | Entity D (900014), Entity A (3538196) | Count: 32388 | Percent of Entity D followers that follow Entity A 3.6% |
| 100% | Entity D (1800027), Entity A (7076391) | Count: 129686 | Percent of Entity D followers that follow Entity A 7.21% |
+ ----- + -------------------------------------------- + -------------- + -------------------------------------------------------------- +
| 12.5% | Entity D (225004), Entity C (4590578) | Count: 15801 | Percent of Entity D followers that follow Entity C 7.03% |
| 25% | Entity D (450007), Entity C (9181155) | Count: 62519 | Percent of Entity D followers that follow Entity C 13.9% |
| 50% | Entity D (900014), Entity C (18362309) | Count: 251367 | Percent of Entity D followers that follow Entity C 27.93% |
| 100% | Entity D (1800027), Entity C (36724618) | Count: 1004919 | Percent of Entity D followers that follow Entity C 55.83% |
+ ----- + -------------------------------------------- + -------------- + -------------------------------------------------------------- +
| 12.5% | Entity D (225004), Entity B (293088) | Count: 688 | Percent of Entity D followers that follow Entity B 0.31% |
| 25% | Entity D (450007), Entity B (586176) | Count: 2852 | Percent of Entity D followers that follow Entity B 0.64% |
| 50% | Entity D (900014), Entity B (1172352) | Count: 11177 | Percent of Entity D followers that follow Entity B 1.25% |
| 100% | Entity D (1800027), Entity B (2344704) | Count: 44611 | Percent of Entity D followers that follow Entity B 2.48% |
+ ----- + -------------------------------------------- + -------------- + -------------------------------------------------------------- +
 
#4
Ah, sorry about that.

1) percent - the percent on the left hand side is the random sample drawn from each group, i.e. 12.5% from Entity A followers & 12.5% from Entity B followers.

2) overlap - the overlap is the percentage of followers from the first Entity that also follow the second Entity. i.e. I follow Dason on Instagram and I also follow noetsi on Instagram. That overlap is what we are looking for.
 

Dason

Ambassador to the humans
#5
Can you explain the sampling process a bit more. And are you calculating overlap solely based on the samples?

I think I see what is going on and if I understand correctly then what you're seeing makes perfect sense
 
#6
The sampling is done by taking a random selection from each list that amounts to the 12.5, 25, 50 percent numbers that you see. And yes the overlap is based on overlap from Entity A corresponds to matches from Entity B based on the samples drawn from each group independently.

To clarify - 50% amounts to half of Entity A's followers and half of Entity B's followers.
 
Last edited:

Dason

Ambassador to the humans
#7
Ok. Yeah it was what I thought. So what you're seeing makes perfect sense. Do I guess my question is what are you looking to do? Estimate things like percent of As followers that also follow B but do that without collecting ALL the data? Because we can derive estimators based on the samples you have but it isn't as simple as using the percent overlap observed in the sample (as you noticed)
 

Dason

Ambassador to the humans
#8
Actually thinking about it briefly a simple correction is just to divide by the proportion that you sampled. For example in that first line of data take .0057/.125 to get .0456 which is pretty close to your result when taking all the data.

There is probably a slightly better approach but that should work. I'll write more tomorrow if I remember...
 
#9
Yes we are trying to estimate the overlapping followers from different groups. I want to make sure I can explain why it is happening but for some reason this just isn't meshing with me.
 
#10
The problem that our dev team is running into is that they would like to not have to pull in 100% of either Entity's followers. I am not sure what the reasonable number of followers would be for us to pull in in order to estimate the overlap of followers from one entity to another. A primary problem we are running into is that when we send an API request for the followers and we request 50% of an Entity's followers the API returns them in a specific order (don't want to get into that, but it prevents us from taking a truly random sample since that 50% is not representative of the whole group) , and thus it is not a random 50%. This is why we have started pulling 100% of the followers. I am thinking this will be our only option, but I am open to any help or advice.
 

Dason

Ambassador to the humans
#11
Well if the sample isn't random then all bets are off. With that said the simple estimator I mentioned seems to fit decently enough on your data. Do you know how the specific order that the API returns the results is chosen?
 

Dason

Ambassador to the humans
#12
To understand why you see what you see I think it helps to look at a simple example. Let's pretend Entity 1 has four followers and Entity 2 has four followers.

Entity 1's followers: A, B, C, D
Entity 2's followers: A, B, E, F

Looking at the percent of Entity 1's followers that also follow Entity 2 we see that A and B both do while C and D don't so there is a 50% overlap.

Let's look at the case where we sample 25% from each entity. These are the possible samples we could choose. Since the samples will just be a single follower from each entity the format will be (entity 1 follower), (entity 2 follower)

A,A
A,B
A,E
A,F
B,A
B,B
B,E
B,F
C,A
C,B
C,E
C,F
D,A
D,B
D,E
D,F

Notice that there are 16 possible samples and only in 2 of the samples did we get a match. That means the probability of observing any overlap is 2/16. The actual results would either be 0% or 100% but our "expected overlap" would be 12.5% (2/16). The reason being that regardless of the sample of entity 1's followers by only sampling 25% of entity 2's followers we *expect* to only get about... well ... 25% of their followers. Imagine that in your sample of entity 1 you got *all* of the people that also followed entity 2 - you still might not identify them as entity 2 followers *if* they weren't chosen in your entity 2 sample.

Note that even though these samples give sample overlap proportions (0% and 100%) far away from both the true value (50%) and the expected value (12.5%) that as the sample sizes increase the sample proportions should get closer to the *expected values* (which as we saw underestimate the true overlap)

The reason the estimator I specified has a chance of working (when the sample is random although it looks ok for your use case) is because the method you're using to sample really gets you an estimate of the percent overlap for a certain proportion of the second entity. Just like if I wanted to know how many free throws you could make when you take 10 from one side of the court and 10 from the other... and if I only let you take 10 shots from one side you might estimate that you would make the same amount of shots on the other side and your estimate for the total would just be (shots made on the first side)*2.
 
Last edited: