I've been helping a colleague with his research project. He is looking at the trustworthiness of profile texts. He is analyzing text entries using the LIWC tool, which generates around 90 dimensions based on text input (positive/negative affect, pronoun use, etc.). Most of it on the ratio or interval level. We might not use all but we will probably want to use 30-70 or so.

In this exploratory research, he wants to identify important predictors of perceived trustworthiness in profile text.

Due to sampling constraints we are likely to get no more than a few hundred (150-400) cases. These are human ratings of trustworthiness. Thus, we have a relatively large number of predictors relative to cases (somewhere between 1:3 and 1:10), which might be a problem. It's exploratory research so I don't think statistical tests make a lot of sense, but I do want to avoid too spurious results.

Now, my question is this: what kind of approach would be most useful for this problem? Preferably something not overly complex, as neither of us are statisticians. Ideally it should be doable in Stata because we're working with that.

My current thinking is something like this:

1. Investigate and report bivariate correlations for all predictors with the outcome

2. Then building a linear regression model with some kind of variable selection, for instance forward selection, backward elimination, or LASSO (the last one might be a bit too complex).

Does that seem at all workable? I'm a bit worried about forward selection/backward elimination, since from what I've read it doesn't produce very stable results.

Any very different ideas would be welcome too. I greatly appreciate any input!

In this exploratory research, he wants to identify important predictors of perceived trustworthiness in profile text.

Due to sampling constraints we are likely to get no more than a few hundred (150-400) cases. These are human ratings of trustworthiness. Thus, we have a relatively large number of predictors relative to cases (somewhere between 1:3 and 1:10), which might be a problem. It's exploratory research so I don't think statistical tests make a lot of sense, but I do want to avoid too spurious results.

Now, my question is this: what kind of approach would be most useful for this problem? Preferably something not overly complex, as neither of us are statisticians. Ideally it should be doable in Stata because we're working with that.

My current thinking is something like this:

1. Investigate and report bivariate correlations for all predictors with the outcome

2. Then building a linear regression model with some kind of variable selection, for instance forward selection, backward elimination, or LASSO (the last one might be a bit too complex).

Does that seem at all workable? I'm a bit worried about forward selection/backward elimination, since from what I've read it doesn't produce very stable results.

Any very different ideas would be welcome too. I greatly appreciate any input!

Last edited: