Analyzing Tagged Content: Is a Multiple Regression Analysis the Right Approach?

#1
I’m working with a relatively small data set that consists of several hundred social media posts, key engagement metrics and up to 10 content “tags” that describe the image of each post. We leveraged Google Vision API along with a manual review to construct the tags. I’ve linked to an example of what we’re working with here (http://imgur.com/vcZkWi9).

What I’m trying to do: I would like to leverage a statistically valid methodology to identify which one or more (in combination) of tags tend to perform the best across the data set. It’s easy enough to look at an individual tag and calculate the mean of the KPI, but any suggestions on how to evaluate combinations of tags that yield high performance? It wouldn’t necessarily need to be all tags in combination, but could be 3 out of the 10 perform the best.

What approach would you recommend to understand what tags are most closely associated with the highest mean KPI score? I’ve been debating whether a multiple regression analysis is best, but looking for some insight on this.
 

CowboyBear

Super Moderator
#2
What I’m trying to do: I would like to leverage a statistically valid methodology
Do you mean use a statistically valid methodology? c.f. Leverage

to identify which one or more (in combination) of tags tend to perform the best across the data set. It’s easy enough to look at an individual tag and calculate the mean of the KPI, but any suggestions on how to evaluate combinations of tags that yield high performance? It wouldn’t necessarily need to be all tags in combination, but could be 3 out of the 10 perform the best.
I'm not sure this is doable given your dataset size. If you're interested in combinations of up to 3 tags at once, you're looking at 10! / 3! (10 - 3)! = 120 different combinations, so 119 regression slopes to estimate. Given that you only have several hundred datapoints to work from, you won't be able to get remotely precise estimates (e.g., consider that each 3-way combination will only come up a tiny handful of times). If you really want to look at the combinations in this data-driven way you will need thousands of data points.