Find out significant variables

#1
Hi
I have a data table consisting of normalized frequencies (out of 100) of certain linguistic features. Please see the sample below.
NewsBlogs MetroBlogs Columns
Amplifiers 0.183680309 0.185342163 0.168975265
Independent clause coordination 0.375209206 0.458896247 0.380954064
Causative adverbial subordinators 0.122236961 0.078101545 0.096996466
Demonstrative pronouns 0.406065178 0.358454746 0.384134276
Discourse particles 0.043535051 0.050551876 0.031095406
Emphatics 0.638847805 0.659161148 0.467985866
First person pronouns 2.155909474 2.232891832 1.30409894
Hedges 0.018630505 0.039933775 0.012226148
Indefinite pronouns 0.061982639 0.030309051 0.040070671
it 1.081559995 0.995077263 1.221272085
There are 34 features in total and three columns as you can see above. I want to see which features (or variables) have significant frequency differences in all three or any two of these columns. Is there a statistical test available to do that?
I am not sure if ANOVA or T Test are relevant (I have heard about them so often but do not understand their working and application).
Regards
 
Last edited:

Karabiner

TS Contributor
#2
I have a data table consisting of normalized frequencies (out of 100) of certain linguistic features.
I do not understand this, I'm afraid. What is the study about? And what did you actually do, how were the data collected? And, in addition, what do you mean by normalized frequencies out of 100?

With kind regards

K.
 
#3
My apologies for not explaining.
I am studying English used in blogs and print media (newspapers etc.). As I said there are 34 linguistic features for which I have collected frequencies and made a table.
A Normalized frequency means percentage in this context. So for example "First Person Pronouns" ("I") occur 2.15/100 words in News Blogs. Using percentage makes it easier to compare data samples of various sizes (for example News Blogs have 100,000 words while Columns have 40,000 words). [I know there are disadvantages of making such normalization, but that is ok for my purpose.]
Since the table is quite long (34 different variables). I want to reduce this number to select only most significant variables. Here significant means the ones which have variation in values in all three (or two) columns. I can do it manually of course by reading each variable and comparing its values, and then deciding if there is a large difference present. For example:
First person pronouns 2.155909474 2.232891832 1.30409894
It seems that First person pronouns ("I") has quite high frequency in NewsBlogs and MetroBlogs (both of these are blogs written by people online). While the same variable has low frequency in Columns (a column is an opinion piece published in a printed newspaper). As I said earlier, I can do the selection of other such variables manually. I wanted to look for a statistical test to make it more reliable, consistent and personal bias-free.
Hopefully I was able to explain my point of view.
Regards
 

Karabiner

TS Contributor
#4
Maybe you don't use percentages (BTW is there particular reason for displaying 10 decimals places?) but frequencies instead. For each corpus you know how many words have a particular feature and how many don't (something like: in corpus A, 38500 words are not "I" and 400 words are "I"; in corpus B, 99.007 words are not "I" and 1200 are "I"). You could perform a 2x3 Chi² Test of association for each feature then. Maybe with Bonferroni-correction for multiple testing (36 features).

Admittedly, I don't know how well Chi² works with huge numbers of observations and low relative frequencies of one characteristic.

With kind regards

K.
 
#5
Hi
Thanks a lot for explanation.
It was unintentional to show 10 decimal points. For the time being I have reduced my data set to two (instead of three). So it is News Blogs and Columns. I did following to see if some variables show more than usual variation.
NewsBlogs Columns NewsBlogs/Columns*100
First person pronouns 2.156 1.304 165
Hedges 0.019 0.012 152
Indefinite pronouns 0.062 0.040 155
Second person pronouns 0.579 0.157 368
Stranded preposition 0.094 0.059 160
Subordinator that deletion 0.168 0.100 167
WH-clauses 0.066 0.039 168
As the formula shows I divided one over the other and multiplied it with 100 to make it easier to read for me. Then I selected percentages which were above 150 or below 50 (only one variable was near with 53 but I have discarded it for the time being). I am not sure what it is called (maybe coefficient). But I thought this will be a good starting point.
Now about chiq-square, I think it does something like this as well. As per your suggested method i would need frequency of a word and total number of words (or is it total - variable's frequency?). I have total number of words in corpus, but raw frequencies are not available. That is, only percentages are available. I think now i need to explore chi-square and if it can be applied to this situation.
Regards