Hi everyone! I am examining a data set from an automobile manufacturer. In doing so, I want to investigate which discount types have had the greatest effect on sales. The overall goal is to find out for which car class which promotion type makes more sense. The dataset contains sales data for 9 different car series over 5 years.

I divide the nine car series into lower class, middle class and high class models. I assume that the different promotion types have different effects on car sales of different classes at the same discount rate. Since the car company always offers some form of discount, every car was sold with at least one discount - on average, there are even 3 discounts per car, and sometimes up to 8. I include each discount in the monthly sales figures in terms of its discount rate relative to the MRSP. However, I distinguish between three main types of promotion: a) direct cash discount, b) additional equipment or c) exclusive special models. All these promotions are reported by the manufacturer in the data with a direct equivalent in cash terms (for the cash discount the amount deducted from the MRSP, for b) and c) the added value the customer receives in the form of accessories or a special model).

So in principle I have thousands of sales records in this form for each month, I made up some sample data for better understanding:

The table is to be understood here exemplarily in such a way: 8 cars were sold in October (in reality, of course, the data set is much larger). Three middle class cars (2x M5, 1x M4), one upper class (M8) and three lower class (1x M1, 2x M2).

To find a suitable independent variable for the regression, I figured I could calculate the average discount rate per promotion type monthly. This would be a good indication of how high the discounts offered per month were for each promotion type.

In this case, it would result in the following discount rates for the month of October, which would be included in my regression:

However, I am very unsure if such a calculation of discount rates makes sense at all, or if a different approach would make more sense? According to my idea, the discount rates calculated in this way should represent a realistic picture of the average discounts offered per car class per promotion type per month. In addition, I also set dummy variables for seasonality, as much more sales occur in certain months of the year across all car classes and models. In addition, I also have a dummy variable that captures whether a particular car model came on the market recently (within 4 months), since new cars have higher sales figures. So this dummy variable refers only to a single car model at a time and not to the entire car class. In addition, I also want to find out whether discounts of the type "cash" had a different effect when they occurred in a higher number (e.g. 2x separate cash discounts of 2000 on one car vs. 1x discount of 4000, this occurred several times as different cash promotions at the same time took place and added up on a single car sale) . In this context, I suspect framing effects, i.e. that many small discounts have a stronger effect than one large discount of the same amount. That’s why I also calculated the avg. cash discounts per sale in the table above (number of cash discounts per car class divided by sales per car class per month).

My conceptual framework, on which my hypotheses are based, looks accordingly like this: the discount rate (independent variable X) has a positive influence on sales (Y). This relationship is moderated by the promotion type (moderator 1) and the framing (moderator 2 = how often the discount occurs on average) (two-way interaction). In addition, the moderator promotion type is moderated by the vehicle class (moderated moderation?): depending on the vehicle class, different promotion types have different effects on the relationship between discount rate and sales (three-way interaction).

Unfortunately, it now feels like I have a lot of ideas and would like to test a lot of things, but I am neither sure if this is possible with the existing data nor if I am approaching it correctly. I am struggling to set up a regression in R that tests my devised framework, so I have doubts about whether this makes any sense at all. I can't use the summed discount rate of all three promotion types as regressor, because I have to use the three discount rates individually as independent variables - that would lead to multicollinearity, is that right?

How can my regression take into account that newly introduced models (dummy “new model”) sell better, since the dummy only applies to the respective new model and not to the whole car class?

Is it even possible to do all this with one regression, or would it make more sense to do individual regressions per car class? When doing several separate regressions, is there a way to generalize them into a overall estimation with coefficients, or should I evaluate each regression and its coefficients by itself then?

I know that is a lot of questions. I would also already be very happy if someone could give me tips in the right direction. I've never had to evaluate something like this before and I'm completely lost - I hope I'm somewhat on the right track with my thoughts and the preparation of the data so far

PS: I use R for the data analysis

Best,

Lamadrama

I divide the nine car series into lower class, middle class and high class models. I assume that the different promotion types have different effects on car sales of different classes at the same discount rate. Since the car company always offers some form of discount, every car was sold with at least one discount - on average, there are even 3 discounts per car, and sometimes up to 8. I include each discount in the monthly sales figures in terms of its discount rate relative to the MRSP. However, I distinguish between three main types of promotion: a) direct cash discount, b) additional equipment or c) exclusive special models. All these promotions are reported by the manufacturer in the data with a direct equivalent in cash terms (for the cash discount the amount deducted from the MRSP, for b) and c) the added value the customer receives in the form of accessories or a special model).

So in principle I have thousands of sales records in this form for each month, I made up some sample data for better understanding:

**Sales in October (one car sold per unique Sale ID)**The table is to be understood here exemplarily in such a way: 8 cars were sold in October (in reality, of course, the data set is much larger). Three middle class cars (2x M5, 1x M4), one upper class (M8) and three lower class (1x M1, 2x M2).

To find a suitable independent variable for the regression, I figured I could calculate the average discount rate per promotion type monthly. This would be a good indication of how high the discounts offered per month were for each promotion type.

In this case, it would result in the following discount rates for the month of October, which would be included in my regression:

However, I am very unsure if such a calculation of discount rates makes sense at all, or if a different approach would make more sense? According to my idea, the discount rates calculated in this way should represent a realistic picture of the average discounts offered per car class per promotion type per month. In addition, I also set dummy variables for seasonality, as much more sales occur in certain months of the year across all car classes and models. In addition, I also have a dummy variable that captures whether a particular car model came on the market recently (within 4 months), since new cars have higher sales figures. So this dummy variable refers only to a single car model at a time and not to the entire car class. In addition, I also want to find out whether discounts of the type "cash" had a different effect when they occurred in a higher number (e.g. 2x separate cash discounts of 2000 on one car vs. 1x discount of 4000, this occurred several times as different cash promotions at the same time took place and added up on a single car sale) . In this context, I suspect framing effects, i.e. that many small discounts have a stronger effect than one large discount of the same amount. That’s why I also calculated the avg. cash discounts per sale in the table above (number of cash discounts per car class divided by sales per car class per month).

My conceptual framework, on which my hypotheses are based, looks accordingly like this: the discount rate (independent variable X) has a positive influence on sales (Y). This relationship is moderated by the promotion type (moderator 1) and the framing (moderator 2 = how often the discount occurs on average) (two-way interaction). In addition, the moderator promotion type is moderated by the vehicle class (moderated moderation?): depending on the vehicle class, different promotion types have different effects on the relationship between discount rate and sales (three-way interaction).

Unfortunately, it now feels like I have a lot of ideas and would like to test a lot of things, but I am neither sure if this is possible with the existing data nor if I am approaching it correctly. I am struggling to set up a regression in R that tests my devised framework, so I have doubts about whether this makes any sense at all. I can't use the summed discount rate of all three promotion types as regressor, because I have to use the three discount rates individually as independent variables - that would lead to multicollinearity, is that right?

How can my regression take into account that newly introduced models (dummy “new model”) sell better, since the dummy only applies to the respective new model and not to the whole car class?

Is it even possible to do all this with one regression, or would it make more sense to do individual regressions per car class? When doing several separate regressions, is there a way to generalize them into a overall estimation with coefficients, or should I evaluate each regression and its coefficients by itself then?

I know that is a lot of questions. I would also already be very happy if someone could give me tips in the right direction. I've never had to evaluate something like this before and I'm completely lost - I hope I'm somewhat on the right track with my thoughts and the preparation of the data so far

PS: I use R for the data analysis

Best,

Lamadrama

Last edited: