+ Reply to Thread
Results 1 to 13 of 13

Thread: Random Sampling - Missing Something Easy and I know it

  1. #1
    Points: 2,196, Level: 28
    Level completed: 31%, Points required for next Level: 104

    Posts
    6
    Thanks
    1
    Thanked 0 Times in 0 Posts

    Random Sampling - Missing Something Easy and I know it




    Hello everyone, I was asked to look over some data that came across my desk and I was stumped as to why it came out the way that it did.

    So the dev team was sampling data from two sources and came across increasing percentages in overlapping data. The objective was to find overlapping followers on a social media platform for two unrelated entities. The entire group of followers was taken and then samples were taken from Entities A, B, C, D and compared for overlap. There was some panic when the results came about and I am hoping you guys can offer some insight.

    What I have struggled to wrap my head around is the steady increase in percent of overlap when going up from 12.5 to 25 to 50 percent. I'm sure there is something I am missing here but I thought I would come ask because I think there should be a reason for this besides doubling the random sample size of both groups. Data is below and please let me know if you have questions, I'm just looking for a reason for the overlap and steady, predictable increase.

    [QUOTE]+ ----- + -------------------------------------------- + -------------- + -------------------------------------------------------------- +
    | 12.5% | Entity A (884549), Entity B (293088) | Count: 4966 | Percent of Entity A followers that follow Entity B 0.57% |
    | 25% | Entity A (1769098), Entity B (586176) | Count: 19995 | Percent of Entity A followers that follow Entity B 1.14% |
    | 50% | Entity A (3538196), Entity B (1172352) | Count: 80623 | Percent of Entity A followers that follow Entity B 2.28% |
    | 100% | Entity A (7076391), Entity B (2344704) | Count: 321584 | Percent of Entity A followers that follow Entity B 4.55% |
    + ----- + -------------------------------------------- + -------------- + -------------------------------------------------------------- +
    | 12.5% | Entity B (293088), Entity C (4590578) | Count: 23558 | Percent of Entity B followers that follow Entity C 8.04% |
    | 25% | Entity B (586176), Entity C (9181155) | Count: 94900 | Percent of Entity B followers that follow Entity C 16.19% |
    | 50% | Entity B (1172352), Entity C (18362309) | Count: 378686 | Percent of Entity B followers that follow Entity C 32.31% |
    | 100% | Entity B (2344704), Entity C (36724618) | Count: 1509755 | Percent of Entity B followers that follow Entity C 64.4% |
    + ----- + -------------------------------------------- + -------------- + -------------------------------------------------------------- +
    | 12.5% | Entity A (884549), Entity C (4590578) | Count: 33842 | Percent of Entity A followers that follow Entity C 3.83% |
    | 25% | Entity A (1769098), Entity C (9181155) | Count: 135480 | Percent of Entity A followers that follow Entity C 7.66% |
    | 50% | Entity A (3538196), Entity C (18362309) | Count: 540018 | Percent of Entity A followers that follow Entity C 15.27% |
    | 100% | Entity A (7076391), Entity C (36724618) | Count: 2161725 | Percent of Entity A followers that follow Entity C 30.55% |
    + ----- + -------------------------------------------- + -------------- + -------------------------------------------------------------- +
    | 12.5% | Entity D (225004), Entity A (884549) | Count: 2094 | Percent of Entity D followers that follow Entity A 0.94% |
    | 25% | Entity D (450007), Entity A (1769098) | Count: 8094 | Percent of Entity D followers that follow Entity A 1.8% |
    | 50% | Entity D (900014), Entity A (3538196) | Count: 32388 | Percent of Entity D followers that follow Entity A 3.6% |
    | 100% | Entity D (1800027), Entity A (7076391) | Count: 129686 | Percent of Entity D followers that follow Entity A 7.21% |
    + ----- + -------------------------------------------- + -------------- + -------------------------------------------------------------- +
    | 12.5% | Entity D (225004), Entity C (4590578) | Count: 15801 | Percent of Entity D followers that follow Entity C 7.03% |
    | 25% | Entity D (450007), Entity C (9181155) | Count: 62519 | Percent of Entity D followers that follow Entity C 13.9% |
    | 50% | Entity D (900014), Entity C (18362309) | Count: 251367 | Percent of Entity D followers that follow Entity C 27.93% |
    | 100% | Entity D (1800027), Entity C (36724618) | Count: 1004919 | Percent of Entity D followers that follow Entity C 55.83% |
    + ----- + -------------------------------------------- + -------------- + -------------------------------------------------------------- +
    | 12.5% | Entity D (225004), Entity B (293088) | Count: 688 | Percent of Entity D followers that follow Entity B 0.31% |
    | 25% | Entity D (450007), Entity B (586176) | Count: 2852 | Percent of Entity D followers that follow Entity B 0.64% |
    | 50% | Entity D (900014), Entity B (1172352) | Count: 11177 | Percent of Entity D followers that follow Entity B 1.25% |
    | 100% | Entity D (1800027), Entity B (2344704) | Count: 44611 | Percent of Entity D followers that follow Entity B 2.48% |
    + ----- + -------------------------------------------- + -------------- + -------------------------------------------------------------- +[QUOTE]

  2. #2
    Devorador de queso
    Points: 97,539, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    Posting AwardCommunity AwardDiscussion EnderFrequent PosterActivity Award
    Dason's Avatar
    Location
    Tampa, FL
    Posts
    12,987
    Thanks
    309
    Thanked 2,640 Times in 2,255 Posts

    Re: Random Sampling - Missing Something Easy and I know it

    Please explain your data. What is the percentage representing?
    I don't have emotions and sometimes that makes me very sad.

  3. #3
    Fortran must die
    Points: 62,031, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,538
    Thanks
    693
    Thanked 916 Times in 875 Posts

    Re: Random Sampling - Missing Something Easy and I know it

    And what do you mean by overlap?
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

  4. #4
    Points: 2,196, Level: 28
    Level completed: 31%, Points required for next Level: 104

    Posts
    6
    Thanks
    1
    Thanked 0 Times in 0 Posts

    Re: Random Sampling - Missing Something Easy and I know it

    Ah, sorry about that.

    1) percent - the percent on the left hand side is the random sample drawn from each group, i.e. 12.5% from Entity A followers & 12.5% from Entity B followers.

    2) overlap - the overlap is the percentage of followers from the first Entity that also follow the second Entity. i.e. I follow Dason on Instagram and I also follow noetsi on Instagram. That overlap is what we are looking for.

  5. #5
    Devorador de queso
    Points: 97,539, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    Posting AwardCommunity AwardDiscussion EnderFrequent PosterActivity Award
    Dason's Avatar
    Location
    Tampa, FL
    Posts
    12,987
    Thanks
    309
    Thanked 2,640 Times in 2,255 Posts

    Re: Random Sampling - Missing Something Easy and I know it

    Can you explain the sampling process a bit more. And are you calculating overlap solely based on the samples?

    I think I see what is going on and if I understand correctly then what you're seeing makes perfect sense
    I don't have emotions and sometimes that makes me very sad.

  6. #6
    Points: 2,196, Level: 28
    Level completed: 31%, Points required for next Level: 104

    Posts
    6
    Thanks
    1
    Thanked 0 Times in 0 Posts

    Re: Random Sampling - Missing Something Easy and I know it

    The sampling is done by taking a random selection from each list that amounts to the 12.5, 25, 50 percent numbers that you see. And yes the overlap is based on overlap from Entity A corresponds to matches from Entity B based on the samples drawn from each group independently.

    To clarify - 50% amounts to half of Entity A's followers and half of Entity B's followers.
    Last edited by txsnowman; 07-24-2017 at 07:42 PM. Reason: more clarification

  7. #7
    Devorador de queso
    Points: 97,539, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    Posting AwardCommunity AwardDiscussion EnderFrequent PosterActivity Award
    Dason's Avatar
    Location
    Tampa, FL
    Posts
    12,987
    Thanks
    309
    Thanked 2,640 Times in 2,255 Posts

    Re: Random Sampling - Missing Something Easy and I know it

    Ok. Yeah it was what I thought. So what you're seeing makes perfect sense. Do I guess my question is what are you looking to do? Estimate things like percent of As followers that also follow B but do that without collecting ALL the data? Because we can derive estimators based on the samples you have but it isn't as simple as using the percent overlap observed in the sample (as you noticed)
    I don't have emotions and sometimes that makes me very sad.

  8. #8
    Devorador de queso
    Points: 97,539, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    Posting AwardCommunity AwardDiscussion EnderFrequent PosterActivity Award
    Dason's Avatar
    Location
    Tampa, FL
    Posts
    12,987
    Thanks
    309
    Thanked 2,640 Times in 2,255 Posts

    Re: Random Sampling - Missing Something Easy and I know it

    Actually thinking about it briefly a simple correction is just to divide by the proportion that you sampled. For example in that first line of data take .0057/.125 to get .0456 which is pretty close to your result when taking all the data.

    There is probably a slightly better approach but that should work. I'll write more tomorrow if I remember...
    I don't have emotions and sometimes that makes me very sad.

  9. #9
    Points: 2,196, Level: 28
    Level completed: 31%, Points required for next Level: 104

    Posts
    6
    Thanks
    1
    Thanked 0 Times in 0 Posts

    Re: Random Sampling - Missing Something Easy and I know it

    Yes we are trying to estimate the overlapping followers from different groups. I want to make sure I can explain why it is happening but for some reason this just isn't meshing with me.

  10. #10
    Points: 2,196, Level: 28
    Level completed: 31%, Points required for next Level: 104

    Posts
    6
    Thanks
    1
    Thanked 0 Times in 0 Posts

    Re: Random Sampling - Missing Something Easy and I know it

    The problem that our dev team is running into is that they would like to not have to pull in 100% of either Entity's followers. I am not sure what the reasonable number of followers would be for us to pull in in order to estimate the overlap of followers from one entity to another. A primary problem we are running into is that when we send an API request for the followers and we request 50% of an Entity's followers the API returns them in a specific order (don't want to get into that, but it prevents us from taking a truly random sample since that 50% is not representative of the whole group) , and thus it is not a random 50%. This is why we have started pulling 100% of the followers. I am thinking this will be our only option, but I am open to any help or advice.

  11. #11
    Devorador de queso
    Points: 97,539, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    Posting AwardCommunity AwardDiscussion EnderFrequent PosterActivity Award
    Dason's Avatar
    Location
    Tampa, FL
    Posts
    12,987
    Thanks
    309
    Thanked 2,640 Times in 2,255 Posts

    Re: Random Sampling - Missing Something Easy and I know it

    Well if the sample isn't random then all bets are off. With that said the simple estimator I mentioned seems to fit decently enough on your data. Do you know how the specific order that the API returns the results is chosen?
    I don't have emotions and sometimes that makes me very sad.

  12. #12
    Devorador de queso
    Points: 97,539, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    Posting AwardCommunity AwardDiscussion EnderFrequent PosterActivity Award
    Dason's Avatar
    Location
    Tampa, FL
    Posts
    12,987
    Thanks
    309
    Thanked 2,640 Times in 2,255 Posts

    Re: Random Sampling - Missing Something Easy and I know it

    To understand why you see what you see I think it helps to look at a simple example. Let's pretend Entity 1 has four followers and Entity 2 has four followers.

    Entity 1's followers: A, B, C, D
    Entity 2's followers: A, B, E, F

    Looking at the percent of Entity 1's followers that also follow Entity 2 we see that A and B both do while C and D don't so there is a 50% overlap.

    Let's look at the case where we sample 25% from each entity. These are the possible samples we could choose. Since the samples will just be a single follower from each entity the format will be (entity 1 follower), (entity 2 follower)

    A,A
    A,B
    A,E
    A,F
    B,A
    B,B
    B,E
    B,F
    C,A
    C,B
    C,E
    C,F
    D,A
    D,B
    D,E
    D,F

    Notice that there are 16 possible samples and only in 2 of the samples did we get a match. That means the probability of observing any overlap is 2/16. The actual results would either be 0% or 100% but our "expected overlap" would be 12.5% (2/16). The reason being that regardless of the sample of entity 1's followers by only sampling 25% of entity 2's followers we *expect* to only get about... well ... 25% of their followers. Imagine that in your sample of entity 1 you got *all* of the people that also followed entity 2 - you still might not identify them as entity 2 followers *if* they weren't chosen in your entity 2 sample.

    Note that even though these samples give sample overlap proportions (0% and 100%) far away from both the true value (50%) and the expected value (12.5%) that as the sample sizes increase the sample proportions should get closer to the *expected values* (which as we saw underestimate the true overlap)

    The reason the estimator I specified has a chance of working (when the sample is random although it looks ok for your use case) is because the method you're using to sample really gets you an estimate of the percent overlap for a certain proportion of the second entity. Just like if I wanted to know how many free throws you could make when you take 10 from one side of the court and 10 from the other... and if I only let you take 10 shots from one side you might estimate that you would make the same amount of shots on the other side and your estimate for the total would just be (shots made on the first side)*2.
    Last edited by Dason; 07-25-2017 at 11:16 AM.
    I don't have emotions and sometimes that makes me very sad.

  13. The Following User Says Thank You to Dason For This Useful Post:

    txsnowman (07-25-2017)

  14. #13
    Points: 2,196, Level: 28
    Level completed: 31%, Points required for next Level: 104

    Posts
    6
    Thanks
    1
    Thanked 0 Times in 0 Posts

    Re: Random Sampling - Missing Something Easy and I know it


    Thanks Dason, appreciated.

+ Reply to Thread

           




Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts






Advertise on Talk Stats