# Model for regression with multiple binomial predictors?

#### Junes

##### New Member
Hi everyone,

I was wondering what model to pick when all the predictors are binomial distributions.

I have a website that's being visited through several (exclusive) channels. I know how many people enter a page from a certain channel. Then, they can go on to purchase and payment. However, not everyone makes it out to the end; most (95%) exit. I know how many people make a purchase in total but noy by channel, and I want to know the contribution of each channel.

The numbers of purchases are too low to use linear regression, so I want to use some count model. The only thing I can find is negative binomial regression, eg package pscl in R. However, then I'm modelling the exits instead of the visits and that seems to be a rather round-about way. This is how I model the exits right now:

Code:
df$abandon <- df$buy_pag_visits- df$buyers model.nb <- glm.nb(abandon ~ channel1 + channel2 + channel3 , link=sqrt) #sqrt seems to work better than log When I search for "binomial regression" I get stuff on logistic regression, where the outcome is binary. That's not what I want. Is there such a thing as "binomial regression" instead of "negative binomial regression"? It seems like such a logical thing but the only thing I see is the negative variant. By the way, I tried approximating by linear regression. It works okayish, but it has obvious problems with the many zeroes. And some of the coefficients become negative, which is obviously impossible. Last edited: #### Junes ##### New Member Another follow-up question: I have reason to believe that the p of one of the binomial distributions has undergone a sudden shift as a result of my actions. How would I model this? I'm a bit lost here. #### hlsmith ##### Omega Contributor A bunch of binary variables predicting purchase yes/no would be logistic regression. You are saying count, but each encounter can only have a single outcome, so you have a series of Bernoulli trials (if independent), then logistic regression. If the trials are not independent, some of the encounters are for the same people at different times, can you track this, and if so, you possibly have multilevel logistic regression, with purchase status clustered in person. If you can convince me this is count data, I can tell you about those options. Though, given your description, I don't think this is the case. P.S., Can each purchase only have one channel, if so, you may not need to use any sophisticated methods. #### Jake ##### Cookie Scientist Do you really mean that the predictors are "binomial", i.e., they follow a binomial distribution and thus are counts from 1 to N? Or do you instead mean that they are "binary" or "categorical" or something else? If they are truly "binomial" then it's really not clear that anything special needs to be done. If they are categorical then they just need to be appropriately dummy- or contrast-coded. It works okayish, but it has obvious problems with the many zeroes. And some of the coefficients become negative, which is obviously impossible. Obvious problems? Why? What problems? And why do you think the coefficients can't be negative just because the predictors are binomial (or binary or categorical or whatever they are)? #### Junes ##### New Member Thanks for the answers, guys. A bunch of binary variables predicting purchase yes/no would be logistic regression. Yes, but I don't have 0/1 as my outcome. This is a snippet of my data: Code: date byr ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 ch9 10-12-16 0 48 13 3 0 7 17 0 2 0 10-13-16 0 50 16 3 0 6 13 0 0 3 10-14-16 0 70 29 2 0 9 23 0 0 1 10-15-16 1 82 44 2 0 6 24 0 2 1 10-16-16 11 136 78 5 0 14 32 0 0 3 10-17-16 4 125 49 1 0 9 24 0 0 3 10-18-16 4 108 52 2 0 9 20 0 0 1 10-19-16 2 124 63 2 0 7 21 0 1 4 10-20-16 5 120 55 6 0 13 26 0 0 4 10-21-16 14 137 66 11 0 3 28 0 1 3 byr = my dependent variable (no of buyers). Furthermore, I have 57% zeroes in my dependent variable. It is very right-skewed. I want to predict buyers by modelling the channels as B(n, p), where the p for each channel is constant over days and to be determined, and n is the number in the table. You are saying count, but each encounter can only have a single outcome, so you have a series of Bernoulli trials (if independent), then logistic regression. If the trials are not independent, some of the encounters are for the same people at different times, can you track this, and if so, you possibly have multilevel logistic regression, with purchase status clustered in person. They are independent, thankfully. Well probably not 100%, but close enough. P.S., Can each purchase only have one channel, if so, you may not need to use any sophisticated methods. Yes, pretty much. There is some cross-talk, not much. Jake said: Do you really mean that the predictors are "binomial", i.e., they follow a binomial distribution and thus are counts from 1 to N? Or do you instead mean that they are "binary" or "categorical" or something else? Yes, I think I should model them as binomial variables. I'm assuming a fixed probability of each user converting for each channel. I have n users on a given day. That sounds to me like B(n, p), but I'm not 100% sure. Obvious problems? Why? What problems? And why do you think the coefficients can't be negative just because the predictors are binomial (or binary or categorical or whatever they are)? Oh, sorry, I was talking about linear regression there. I did a linear regression and it did reasonably well, but it had trouble with the edge cases. If coefficients are negative that means conceptually that a channel leads to a loss of sales, which is not possible. So, a linear regression can only be a rough approximation, I think. Last edited: #### hlsmith ##### Omega Contributor Perhaps you also need to present a case example for an observation - to help us understand the data generating function. Is Y a count variable because you have meta-data for the day. So it is actually single purchased y/n, but what you have is all of them for the day? Can you link actual channels to the outcome or do you have counts for outcome and channels? I still need a little more info, but zero inflated Poisson regression may be an option. #### Jake ##### Cookie Scientist Basically I still don't see what the problem is with a model like Code: glm.nb(byr ~ channel1 + channel2 + channel3) Junes said: Is there such a thing as "binomial regression" instead of "negative binomial regression"? Yes. Binomial regression is regression where the outcome is binomial. So two examples of binomial regression are logistic regression and probit regression. Which are just binomial regression with logit or probit link functions, respectively. Note that virtually all non-Normal variants of regression refer to the distribution of the outcome, not the predictor. Because regression already doesn't make distributional assumptions about the predictors, we don't generally need exotic forms of regression for handling predictors with particular types of distributions. This is partly why I'm a bit skeptical that you really need what you think you need. Now because your outcome is a count with low values, it makes sense that negative binomial regression would be appropriate. But of course this has nothing to do with the predictors. The question is whether, in addition to the negative binomial assumption for the outcome, you also need to do something special because of the distributions of your predictors. Junes said: If coefficients are negative that means conceptually that a channel leads to a loss of sales, which is not possible. Why not? It seems easy to imagine that there is some unmeasured confounding factor that, when active on a particular day, leads respondents to be more likely to visit the site through a particular channel, but that also makes respondents less likely to ultimately buy something. I don't see why a channel having a negative coefficient is a logical impossibility. Also, note that even if the simple/total effect of a channel really is impossible to be negative (although I'm not really convinced), that doesn't preclude the multiple regression coefficient from being negative, since that estimates the effect while holding constant the other predictors. Junes said: I want to predict buyers by modelling the channels as B(n, p), where the p for each channel is constant over days and to be determined, and n is the number in the table. This doesn't really make sense to me. You want to predict buyers by modeling something other than buyers? Suppose you did apply some binomial model to your predictors. How then are you wanting to use those binomial models to predict buyers? Do you want to then use those estimated values of $$n$$ and/or $$p$$ as predictors in the negative binomial regression of buyers? Junes said: Yes, I think I should model them as binomial variables. I'm assuming a fixed probability of each user converting for each channel. I have n users on a given day. That sounds to me like B(n, p), but I'm not 100% sure. What does "converting" mean here? #### Junes ##### New Member Thanks again for the replies. This is helping me clarifying my thinking so much. Jake said: What does "converting" mean here? Sorry, marketing speak. "To convert" is for a user to perform an action of value to the website owner. Hence, you have conversions, conversion rate, etc. Note that virtually all non-Normal variants of regression refer to the distribution of the outcome, not the predictor. Because regression already doesn't make distributional assumptions about the predictors, we don't generally need exotic forms of regression for handling predictors with particular types of distributions. This is partly why I'm a bit skeptical that you really need what you think you need. Now because your outcome is a count with low values, it makes sense that negative binomial regression would be appropriate. But of course this has nothing to do with the predictors. The question is whether, in addition to the negative binomial assumption for the outcome, you also need to do something special because of the distributions of your predictors. Ah, I see. This makes sense. But I'm still a bit puzzled why I need to be modelling the thing I'm not interested in (leaving the channel). Isn't there a model for the thing I'm interested in (conversions)? I realize I can model one and calculate the other (as I'm doing now), but it seems a bit round-about? Why not? It seems easy to imagine that there is some unmeasured confounding factor that, when active on a particular day, leads respondents to be more likely to visit the site through a particular channel, but that also makes respondents less likely to ultimately buy something. I don't see why a channel having a negative coefficient is a logical impossibility. Also, note that even if the simple/total effect of a channel really is impossible to be negative (although I'm not really convinced), that doesn't preclude the multiple regression coefficient from being negative, since that estimates the effect while holding constant the other predictors. You're absolutely right. I was thinking about it causally, it didn't make sense to me that the presence of a user on a channel would make a purchase less likely. But yes, say a channel is used a lot by spammer and that slows down the site, it would make overall sales go down. But still, in my case, I find it hard to see people present on a channel leading to a net negative sale. I'm trying hard to think of realistic confounding factors that would do that. Maybe something like the day of week. The purchases are tickets for a museum, so I can imagine day of week being a confounder factor for channel use (e.g., more through ads on certain days) and ticket sales. Last edited: #### Junes ##### New Member Perhaps you also need to present a case example for an observation - to help us understand the data generating function. Thanks hlsmith. So, the data are about website channels. There is one for organic search (people enter through Google), one for direct use (people enter website directly), several for different advertising campaigns. People are assigned one channel (I'm not 100% sure I didn't make a data error somewhere but the overlap between the channels should be minimal). Also, I captured most of the traffic with these channels. There's probably some minor one I missed, but not much. Now what happens is that a user criss-crosses on the website, lands on the ticket page (my data) - and then I lose them. They go on to another website to make the sale and the museum owner didn't set the tracking properly. Or, they leave the website altogether. I simply don't know. People have to make an account to buy tickets, which is something they are loath to do in this day and age. Only 4.1% of the people landing on the ticket page make a transaction (though that number may be artificially lowered by fraudulent spam, see below). As you say, I only have the aggregated result per day - both in total number of transactions and total number of tickets sold (each transaction concerns an unknown number of tickets sold). Can you link actual channels to the outcome or do you have counts for outcome and channels? The latter. I still need a little more info, but zero inflated Poisson regression may be an option. Thanks, going to check that out. I will also post a bit later after this what I got now using the negative binomial. By the way, this project has changed from attribution to possible fraud detection. I have serious suspicions one of the channels is not "real" traffic. Possibly due to a part of the traffic being bots or from people just clicking without being interested in going to the museum. This is called "click fraud": advertising agencies charge you for traffic to your website that's actually worthless. There are even so-called "click farms" in places like Bangladesh where people do nothing but click websites all day for shady companies. I have some other, quite serious evidence from Google Analytics, but that's pretty restricted to some minor aspect of this fraud - it would also help if I can show that the coefficient for the advertising channel is very low. #### Junes ##### New Member Jake said: Basically I still don't see what the problem is with a model like Code: glm.nb(byr ~ channel1 + channel2 + channel3) Ah wait, I missed this. I can do this? I was under the impression that with negative binomial I need to model the people not making a purchase? So that the negative binomial is the distribution of non-purchases to happen before a purchase? So you are saying I could also model the distribution of purchases before a non-purchase? Even though a purchase is very rare (4%)? Last edited: #### Junes ##### New Member So, I've uploaded the data for anyone who wants to take a look. Code: date (European notation) tickets = tickets sold buyers = number of transactions exits = number of people leaving the ticket page and also the website ticket_page = number of people entering the ticket page organic = organic search (Google) adwords_name = AdWords channel, searched by name museum adwords_other = AdWords channel, searched by something other than name referral = traffic from another website direct = direct traffic (e.g., typed in website url) adv_banner = suspect campaign, via banner adv_page = suspect campaign, via information page adv_search = suspect campaign, via search results My R code: Code: df <- read.csv("attribution.txt") pairs(df[,2:ncol(df)]) The distribution of the suspect campaign channels is not strange. However, you need a lot more visitors to get the same effect. That tells me some of the traffic may be polluted with bots. Note that the most important and expensive one, adv_banner is only active for a few weeks per year, hence the many zeroes. Code: model <- glm.nb(buyers ~ direct + organic + adwords_name + adwords_other + referral + adv_banner + adv_search + adv_page, data=df, link=sqrt) summary(model) plot(df$buyers,predict(model,type="response"))
I'm using the square-root link function because it gives a much better fit. I understand that the curve is less steeper (which fits the data), but I'm not sure if I know what it does conceptually.

Summary:

Code:
Call:
link = sqrt, init.theta = 0.8022468254)

Deviance Residuals:
Min       1Q   Median       3Q      Max
-2.0358  -1.0030  -0.7609   0.1350   3.0869

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)    0.425533   0.065654   6.481 9.09e-11 ***
direct        -0.005586   0.008421  -0.663   0.5072
organic        0.030517   0.005451   5.599 2.16e-08 ***
referral       0.023910   0.011101   2.154   0.0313 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Negative Binomial(0.8022) family taken to be 1)

Null deviance: 535.10  on 334  degrees of freedom
Residual deviance: 323.05  on 326  degrees of freedom
AIC: 1065.4

Number of Fisher Scoring iterations: 1

Theta:  0.802
Std. Err.:  0.144

2 x log-likelihood:  -1045.427

The fit is not great, but it works.

So the adv coefficients are really low. For the banner especially, the standard error is also relatively low. The upper limit of its confidence interval is 0.0188 (95%) or 0.0225 (99%). That seems low compared to referral, which is probably the best direct comparison. It seems suspiscious, taken together with some other information that's not in this data.

Is it a problem that my predictors are somewhat correlated (mean 0.33, range 0 to 0.80)?

I also realized I have the number of people exiting the ticket page, instead of just entering the page. When people exit, they either exit altogether or they go to the page that takes care the transaction. The number of exits correlates somewhat better with the number of transactions. But I'm not sure how to integrate it in the model.

#### hlsmith

##### Omega Contributor
So your data is still unclear to me. If the people could only take one channel, well you would have dummy codes for the independent variables. Say channel was age group, you can't be 0-10 years old and also 11-20 years old.

If you only have a list of counts, and it is not linked to the dependent variable, how is this relationship connected. So in post #5 first data row of your data frame, there were 0 purchases and 48 people entering site through channel 1. Now for row #4, there was one purchase and 82 people entering from channel #1, 46 from ch #2,...,1 from ch#9. How do you know where the purchase came from given your data. If I know 5% cars are red, 80 blue, and 15% white and there are 10 car accidents, I am working with 2 unlinked pieces of meta-data, I could create a probably model given car accidents are random and not with other cars and figure out the probability of a blue car crashing, but that is based on prevalence and doesn't tell me about say how all blue cars are driven by inexperienced drivers and actually 90% of car accidents include a blue cars. Do you see my repeated inquiry yet. I don't get if how you have two counts you can answer your question unless you know subject specific level data, ob1 used ch#1 and didn't buy anything; ob2 used ch#1 and bought something,...,obs 999 used ch#9 and didn' buy anything. And given that structure channels should be dummy variables.

#### Junes

##### New Member
So your data is still unclear to me. If the people could only take one channel, well you would have dummy codes for the independent variables. Say channel was age group, you can't be 0-10 years old and also 11-20 years old.
Yes, but here it's a count.

If you only have a list of counts, and it is not linked to the dependent variable, how is this relationship connected. So in post #5 first data row of your data frame, there were 0 purchases and 48 people entering site through channel 1. Now for row #4, there was one purchase and 82 people entering from channel #1, 46 from ch #2,...,1 from ch#9. How do you know where the purchase came from given your data.
Well, I don't know yet. That's what I want to find out by modelling.

I don't get if how you have two counts you can answer your question unless you know subject specific level data, ob1 used ch#1 and didn't buy anything; ob2 used ch#1 and bought something,...,obs 999 used ch#9 and didn' buy anything. And given that structure channels should be dummy variables.
Well, for an extreme example:

Code:
byr ch1   ch2
2    999   2
1    0      1
3    20     0
Given these data, where would you say byr comes from? Now imagine the same idea but where the difference is less clear, but where you have more observations.

#### Jake

Ah wait, I missed this. I can do this?
Yeah, I don't see why not.

I was under the impression that with negative binomial I need to model the people not making a purchase? So that the negative binomial is the distribution of non-purchases to happen before a purchase?
No... "negative binomial regression" just means that we assume the outcome follows a negative binomial distribution. For practical purposes in a regression context, you can just think of this as a Poisson distribution except with a flexible dispersion parameter (rather than a fixed mean/variance relationship) so that it can better handle underdispersion or overdispersion relative to Poisson regression. The "negative" does not mean that you have to model non-events, or anything like that.

So you are saying I could also model the distribution of purchases before a non-purchase? Even though a purchase is very rare (4%)?
Yep. In fact, the fact that you have a very low rate of occurrences in your outcome is the main reason why I think it's worth it to use a count model (like NB or Poisson regression) at all. Classical/Normal regression usually works pretty much fine when the occurrence rate is high enough, but not when the occurrence rate is low, because you run into things like negative fitted values (which really are impossible) and because homogeneity of error variance is almost certainly violated in such cases.

#### GretaGarbo

##### Human
It seems like these data are data that sums of events for each day. But that it is not known from which channel each buyer came from. So it would be like multi-level data, with only information of of level 2 and no information about the individuals. Then there is a risk of simpsons paradox or the ecological fallacy (like in the classic study by Robinson with data from 1930:ies about illiteracy and "race". The data were summed by municipality or something, and gave much higher correlation coefficients than what would have been achieved by individual data.)

I guess that these data are a good start, but maybe it is possible to get individual data later.

#### hlsmith

##### Omega Contributor
Greta, this is what I am trying to get at (ecological fallacy). There are 5 blue cars, 10 red cars, and 18 whites cars and a police traffic stop, 5 drivers were arrested for intoxication. You don't know which drivers were arrested. If it was random, then you can construct probabilities. But the fact that this whole question is being asked provides doubt in regards to it merrily being random. There are no numerators.

I got a bill at the restaurant for $100 and I order two pizzas, cola, bread sticks. How much did each item cost? No idea. Now you do this over a bunch of restaurants, you may get a little hit of a pattern but the costs could be totally different at other restaurants (just like people using Google or direct URL can't assume to be the same each or across days). I think trying to use or justify a model still does not negate the overlying issue. #### Junes ##### New Member Thanks guys. And guess what? That channel that didn't lead to sales? I've looked around Google Analytics all day and I've discovered an extensive bot network with servers all over the country. I just wrote up an 11-page report and sent it to my client. They've probably scammed my client out of thousands of euros worth of advertising, possibly tens of thousands. Wow. And I discovered all this just out of curiosity, thinking how I could tackle this statistical problem. Just saying, statistics brings you places #### Junes ##### New Member Greta, this is what I am trying to get at (ecological fallacy). There are 5 blue cars, 10 red cars, and 18 whites cars and a police traffic stop, 5 drivers were arrested for intoxication. You don't know which drivers were arrested. If it was random, then you can construct probabilities. But the fact that this whole question is being asked provides doubt in regards to it merrily being random. There are no numerators. I got a bill at the restaurant for$100 and I order two pizzas, cola, bread sticks. How much did each item cost? No idea. Now you do this over a bunch of restaurants, you may get a little hit of a pattern but the costs could be totally different at other restaurants (just like people using Google or direct URL can't assume to be the same each or across days). I think trying to use or justify a model still does not negate the overlying issue.
Ah, now I see what you mean. Yeah, I guess that's a fair point. I'm not sure how likely ecological fallacy type events are in this context, but that's certainly a possibility.