Another follow-up question: I have reason to believe that the p of one of the binomial distributions has undergone a sudden shift as a result of my actions. How would I model this? I'm a bit lost here.
Hi everyone,
I was wondering what model to pick when all the predictors are binomial distributions.
I have a website that's being visited through several (exclusive) channels. I know how many people enter a page from a certain channel. Then, they can go on to purchase and payment. However, not everyone makes it out to the end; most (95%) exit. I know how many people make a purchase in total but noy by channel, and I want to know the contribution of each channel.
The numbers of purchases are too low to use linear regression, so I want to use some count model. The only thing I can find is negative binomial regression, eg package pscl in R. However, then I'm modelling the exits instead of the visits and that seems to be a rather round-about way. This is how I model the exits right now:
When I search for "binomial regression" I get stuff on logistic regression, where the outcome is binary. That's not what I want.Code:df$abandon <- df$buy_pag_visits- df$buyers model.nb <- glm.nb(abandon ~ channel1 + channel2 + channel3 , link=sqrt) #sqrt seems to work better than log
Is there such a thing as "binomial regression" instead of "negative binomial regression"? It seems like such a logical thing but the only thing I see is the negative variant.
By the way, I tried approximating by linear regression. It works okayish, but it has obvious problems with the many zeroes. And some of the coefficients become negative, which is obviously impossible.
Last edited by Junes; 08-02-2017 at 03:54 PM.
Another follow-up question: I have reason to believe that the p of one of the binomial distributions has undergone a sudden shift as a result of my actions. How would I model this? I'm a bit lost here.
A bunch of binary variables predicting purchase yes/no would be logistic regression. You are saying count, but each encounter can only have a single outcome, so you have a series of Bernoulli trials (if independent), then logistic regression. If the trials are not independent, some of the encounters are for the same people at different times, can you track this, and if so, you possibly have multilevel logistic regression, with purchase status clustered in person.
If you can convince me this is count data, I can tell you about those options. Though, given your description, I don't think this is the case.
P.S., Can each purchase only have one channel, if so, you may not need to use any sophisticated methods.
Stop cowardice, ban guns!
Do you really mean that the predictors are "binomial", i.e., they follow a binomial distribution and thus are counts from 1 to N? Or do you instead mean that they are "binary" or "categorical" or something else?
If they are truly "binomial" then it's really not clear that anything special needs to be done. If they are categorical then they just need to be appropriately dummy- or contrast-coded.
Obvious problems? Why? What problems? And why do you think the coefficients can't be negative just because the predictors are binomial (or binary or categorical or whatever they are)?
In God we trust. All others must bring data.
~W. Edwards Deming
Thanks for the answers, guys.
Yes, but I don't have 0/1 as my outcome. This is a snippet of my data:
byr = my dependent variable (no of buyers). Furthermore, I have 57% zeroes in my dependent variable. It is very right-skewed.Code:date byr ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 ch9 10-12-16 0 48 13 3 0 7 17 0 2 0 10-13-16 0 50 16 3 0 6 13 0 0 3 10-14-16 0 70 29 2 0 9 23 0 0 1 10-15-16 1 82 44 2 0 6 24 0 2 1 10-16-16 11 136 78 5 0 14 32 0 0 3 10-17-16 4 125 49 1 0 9 24 0 0 3 10-18-16 4 108 52 2 0 9 20 0 0 1 10-19-16 2 124 63 2 0 7 21 0 1 4 10-20-16 5 120 55 6 0 13 26 0 0 4 10-21-16 14 137 66 11 0 3 28 0 1 3
I want to predict buyers by modelling the channels as B(n, p), where the p for each channel is constant over days and to be determined, and n is the number in the table.
They are independent, thankfully. Well probably not 100%, but close enough.You are saying count, but each encounter can only have a single outcome, so you have a series of Bernoulli trials (if independent), then logistic regression. If the trials are not independent, some of the encounters are for the same people at different times, can you track this, and if so, you possibly have multilevel logistic regression, with purchase status clustered in person.
Yes, pretty much. There is some cross-talk, not much.P.S., Can each purchase only have one channel, if so, you may not need to use any sophisticated methods.
Yes, I think I should model them as binomial variables. I'm assuming a fixed probability of each user converting for each channel. I have n users on a given day. That sounds to me like B(n, p), but I'm not 100% sure.Originally Posted by Jake
Oh, sorry, I was talking about linear regression there. I did a linear regression and it did reasonably well, but it had trouble with the edge cases. If coefficients are negative that means conceptually that a channel leads to a loss of sales, which is not possible. So, a linear regression can only be a rough approximation, I think.Obvious problems? Why? What problems? And why do you think the coefficients can't be negative just because the predictors are binomial (or binary or categorical or whatever they are)?
Last edited by Junes; 08-03-2017 at 09:52 AM.
Perhaps you also need to present a case example for an observation - to help us understand the data generating function.
Is Y a count variable because you have meta-data for the day. So it is actually single purchased y/n, but what you have is all of them for the day? Can you link actual channels to the outcome or do you have counts for outcome and channels?
I still need a little more info, but zero inflated Poisson regression may be an option.
Stop cowardice, ban guns!
Junes (08-04-2017)
Basically I still don't see what the problem is with a model likeCode:glm.nb(byr ~ channel1 + channel2 + channel3)
Yes. Binomial regression is regression where the outcome is binomial. So two examples of binomial regression are logistic regression and probit regression. Which are just binomial regression with logit or probit link functions, respectively.Originally Posted by Junes
Note that virtually all non-Normal variants of regression refer to the distribution of the outcome, not the predictor. Because regression already doesn't make distributional assumptions about the predictors, we don't generally need exotic forms of regression for handling predictors with particular types of distributions. This is partly why I'm a bit skeptical that you really need what you think you need.
Now because your outcome is a count with low values, it makes sense that negative binomial regression would be appropriate. But of course this has nothing to do with the predictors. The question is whether, in addition to the negative binomial assumption for the outcome, you also need to do something special because of the distributions of your predictors.
Why not? It seems easy to imagine that there is some unmeasured confounding factor that, when active on a particular day, leads respondents to be more likely to visit the site through a particular channel, but that also makes respondents less likely to ultimately buy something. I don't see why a channel having a negative coefficient is a logical impossibility.Originally Posted by Junes
Also, note that even if the simple/total effect of a channel really is impossible to be negative (although I'm not really convinced), that doesn't preclude the multiple regression coefficient from being negative, since that estimates the effect while holding constant the other predictors.
This doesn't really make sense to me. You want to predict buyers by modeling something other than buyers? Suppose you did apply some binomial model to your predictors. How then are you wanting to use those binomial models to predict buyers? Do you want to then use those estimated values of and/or as predictors in the negative binomial regression of buyers?Originally Posted by Junes
What does "converting" mean here?Originally Posted by Junes
In God we trust. All others must bring data.
~W. Edwards Deming
Junes (08-04-2017)
Thanks again for the replies. This is helping me clarifying my thinking so much.
Sorry, marketing speak. "To convert" is for a user to perform an action of value to the website owner. Hence, you have conversions, conversion rate, etc.Originally Posted by Jake
Ah, I see. This makes sense. But I'm still a bit puzzled why I need to be modelling the thing I'm not interested in (leaving the channel). Isn't there a model for the thing I'm interested in (conversions)? I realize I can model one and calculate the other (as I'm doing now), but it seems a bit round-about?Note that virtually all non-Normal variants of regression refer to the distribution of the outcome, not the predictor. Because regression already doesn't make distributional assumptions about the predictors, we don't generally need exotic forms of regression for handling predictors with particular types of distributions. This is partly why I'm a bit skeptical that you really need what you think you need.
Now because your outcome is a count with low values, it makes sense that negative binomial regression would be appropriate. But of course this has nothing to do with the predictors. The question is whether, in addition to the negative binomial assumption for the outcome, you also need to do something special because of the distributions of your predictors.
You're absolutely right. I was thinking about it causally, it didn't make sense to me that the presence of a user on a channel would make a purchase less likely. But yes, say a channel is used a lot by spammer and that slows down the site, it would make overall sales go down. But still, in my case, I find it hard to see people present on a channel leading to a net negative sale. I'm trying hard to think of realistic confounding factors that would do that. Maybe something like the day of week. The purchases are tickets for a museum, so I can imagine day of week being a confounder factor for channel use (e.g., more through ads on certain days) and ticket sales.Why not? It seems easy to imagine that there is some unmeasured confounding factor that, when active on a particular day, leads respondents to be more likely to visit the site through a particular channel, but that also makes respondents less likely to ultimately buy something. I don't see why a channel having a negative coefficient is a logical impossibility.
Also, note that even if the simple/total effect of a channel really is impossible to be negative (although I'm not really convinced), that doesn't preclude the multiple regression coefficient from being negative, since that estimates the effect while holding constant the other predictors.
Last edited by Junes; 08-04-2017 at 02:31 AM.
Thanks hlsmith. So, the data are about website channels. There is one for organic search (people enter through Google), one for direct use (people enter website directly), several for different advertising campaigns. People are assigned one channel (I'm not 100% sure I didn't make a data error somewhere but the overlap between the channels should be minimal). Also, I captured most of the traffic with these channels. There's probably some minor one I missed, but not much.
Now what happens is that a user criss-crosses on the website, lands on the ticket page (my data) - and then I lose them. They go on to another website to make the sale and the museum owner didn't set the tracking properly. Or, they leave the website altogether. I simply don't know. People have to make an account to buy tickets, which is something they are loath to do in this day and age. Only 4.1% of the people landing on the ticket page make a transaction (though that number may be artificially lowered by fraudulent spam, see below).
As you say, I only have the aggregated result per day - both in total number of transactions and total number of tickets sold (each transaction concerns an unknown number of tickets sold).
The latter.Can you link actual channels to the outcome or do you have counts for outcome and channels?
Thanks, going to check that out. I will also post a bit later after this what I got now using the negative binomial.I still need a little more info, but zero inflated Poisson regression may be an option.
By the way, this project has changed from attribution to possible fraud detection. I have serious suspicions one of the channels is not "real" traffic. Possibly due to a part of the traffic being bots or from people just clicking without being interested in going to the museum. This is called "click fraud": advertising agencies charge you for traffic to your website that's actually worthless. There are even so-called "click farms" in places like Bangladesh where people do nothing but click websites all day for shady companies.
I have some other, quite serious evidence from Google Analytics, but that's pretty restricted to some minor aspect of this fraud - it would also help if I can show that the coefficient for the advertising channel is very low.
Ah wait, I missed this. I can do this? I was under the impression that with negative binomial I need to model the people not making a purchase? So that the negative binomial is the distribution of non-purchases to happen before a purchase?Originally Posted by Jake
So you are saying I could also model the distribution of purchases before a non-purchase? Even though a purchase is very rare (4%)?
Last edited by Junes; 08-04-2017 at 04:18 AM.
So, I've uploaded the data for anyone who wants to take a look.
My R code:Code:date (European notation) tickets = tickets sold buyers = number of transactions exits = number of people leaving the ticket page and also the website ticket_page = number of people entering the ticket page organic = organic search (Google) adwords_name = AdWords channel, searched by name museum adwords_other = AdWords channel, searched by something other than name referral = traffic from another website direct = direct traffic (e.g., typed in website url) adv_banner = suspect campaign, via banner adv_page = suspect campaign, via information page adv_search = suspect campaign, via search results
The distribution of the suspect campaign channels is not strange. However, you need a lot more visitors to get the same effect. That tells me some of the traffic may be polluted with bots. Note that the most important and expensive one, adv_banner is only active for a few weeks per year, hence the many zeroes.Code:df <- read.csv("attribution.txt") pairs(df[,2:ncol(df)])
I'm using the square-root link function because it gives a much better fit. I understand that the curve is less steeper (which fits the data), but I'm not sure if I know what it does conceptually.Code:model <- glm.nb(buyers ~ direct + organic + adwords_name + adwords_other + referral + adv_banner + adv_search + adv_page, data=df, link=sqrt) summary(model) plot(df$buyers,predict(model,type="response"))
Summary:
Code:Call: glm.nb(formula = buyers ~ direct + organic + adwords_name + adwords_other + referral + adv_banner + adv_search + adv_page, data = df, link = sqrt, init.theta = 0.8022468254) Deviance Residuals: Min 1Q Median 3Q Max -2.0358 -1.0030 -0.7609 0.1350 3.0869 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.425533 0.065654 6.481 9.09e-11 *** direct -0.005586 0.008421 -0.663 0.5072 organic 0.030517 0.005451 5.599 2.16e-08 *** adwords_name 0.012561 0.023402 0.537 0.5915 adwords_other 0.006752 0.026892 0.251 0.8017 referral 0.023910 0.011101 2.154 0.0313 * adv_banner 0.007145 0.005959 1.199 0.2306 adv_search 0.004567 0.032850 0.139 0.8894 adv_page -0.049142 0.050922 -0.965 0.3345 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 (Dispersion parameter for Negative Binomial(0.8022) family taken to be 1) Null deviance: 535.10 on 334 degrees of freedom Residual deviance: 323.05 on 326 degrees of freedom AIC: 1065.4 Number of Fisher Scoring iterations: 1 Theta: 0.802 Std. Err.: 0.144 2 x log-likelihood: -1045.427
The fit is not great, but it works.
So the adv coefficients are really low. For the banner especially, the standard error is also relatively low. The upper limit of its confidence interval is 0.0188 (95%) or 0.0225 (99%). That seems low compared to referral, which is probably the best direct comparison. It seems suspiscious, taken together with some other information that's not in this data.
Is it a problem that my predictors are somewhat correlated (mean 0.33, range 0 to 0.80)?
I also realized I have the number of people exiting the ticket page, instead of just entering the page. When people exit, they either exit altogether or they go to the page that takes care the transaction. The number of exits correlates somewhat better with the number of transactions. But I'm not sure how to integrate it in the model.
So your data is still unclear to me. If the people could only take one channel, well you would have dummy codes for the independent variables. Say channel was age group, you can't be 0-10 years old and also 11-20 years old.
If you only have a list of counts, and it is not linked to the dependent variable, how is this relationship connected. So in post #5 first data row of your data frame, there were 0 purchases and 48 people entering site through channel 1. Now for row #4, there was one purchase and 82 people entering from channel #1, 46 from ch #2,...,1 from ch#9. How do you know where the purchase came from given your data. If I know 5% cars are red, 80 blue, and 15% white and there are 10 car accidents, I am working with 2 unlinked pieces of meta-data, I could create a probably model given car accidents are random and not with other cars and figure out the probability of a blue car crashing, but that is based on prevalence and doesn't tell me about say how all blue cars are driven by inexperienced drivers and actually 90% of car accidents include a blue cars. Do you see my repeated inquiry yet. I don't get if how you have two counts you can answer your question unless you know subject specific level data, ob1 used ch#1 and didn't buy anything; ob2 used ch#1 and bought something,...,obs 999 used ch#9 and didn' buy anything. And given that structure channels should be dummy variables.
Stop cowardice, ban guns!
Yes, but here it's a count.
Well, I don't know yet. That's what I want to find out by modelling.If you only have a list of counts, and it is not linked to the dependent variable, how is this relationship connected. So in post #5 first data row of your data frame, there were 0 purchases and 48 people entering site through channel 1. Now for row #4, there was one purchase and 82 people entering from channel #1, 46 from ch #2,...,1 from ch#9. How do you know where the purchase came from given your data.
Well, for an extreme example:I don't get if how you have two counts you can answer your question unless you know subject specific level data, ob1 used ch#1 and didn't buy anything; ob2 used ch#1 and bought something,...,obs 999 used ch#9 and didn' buy anything. And given that structure channels should be dummy variables.
Given these data, where would you say byr comes from? Now imagine the same idea but where the difference is less clear, but where you have more observations.Code:byr ch1 ch2 2 999 2 1 0 1 3 20 0
Yeah, I don't see why not.
No... "negative binomial regression" just means that we assume the outcome follows a negative binomial distribution. For practical purposes in a regression context, you can just think of this as a Poisson distribution except with a flexible dispersion parameter (rather than a fixed mean/variance relationship) so that it can better handle underdispersion or overdispersion relative to Poisson regression. The "negative" does not mean that you have to model non-events, or anything like that.
Yep. In fact, the fact that you have a very low rate of occurrences in your outcome is the main reason why I think it's worth it to use a count model (like NB or Poisson regression) at all. Classical/Normal regression usually works pretty much fine when the occurrence rate is high enough, but not when the occurrence rate is low, because you run into things like negative fitted values (which really are impossible) and because homogeneity of error variance is almost certainly violated in such cases.
In God we trust. All others must bring data.
~W. Edwards Deming
Junes (08-04-2017)
It seems like these data are data that sums of events for each day. But that it is not known from which channel each buyer came from. So it would be like multi-level data, with only information of of level 2 and no information about the individuals. Then there is a risk of simpsons paradox or the ecological fallacy (like in the classic study by Robinson with data from 1930:ies about illiteracy and "race". The data were summed by municipality or something, and gave much higher correlation coefficients than what would have been achieved by individual data.)
I guess that these data are a good start, but maybe it is possible to get individual data later.
Tweet |