+ Reply to Thread
Results 1 to 5 of 5

Thread: Find out significant variables

  1. #1
    Points: 2,096, Level: 27
    Level completed: 64%, Points required for next Level: 54

    Posts
    12
    Thanks
    5
    Thanked 0 Times in 0 Posts

    Find out significant variables




    Hi
    I have a data table consisting of normalized frequencies (out of 100) of certain linguistic features. Please see the sample below.
    NewsBlogs MetroBlogs Columns
    Amplifiers 0.183680309 0.185342163 0.168975265
    Independent clause coordination 0.375209206 0.458896247 0.380954064
    Causative adverbial subordinators 0.122236961 0.078101545 0.096996466
    Demonstrative pronouns 0.406065178 0.358454746 0.384134276
    Discourse particles 0.043535051 0.050551876 0.031095406
    Emphatics 0.638847805 0.659161148 0.467985866
    First person pronouns 2.155909474 2.232891832 1.30409894
    Hedges 0.018630505 0.039933775 0.012226148
    Indefinite pronouns 0.061982639 0.030309051 0.040070671
    it 1.081559995 0.995077263 1.221272085
    There are 34 features in total and three columns as you can see above. I want to see which features (or variables) have significant frequency differences in all three or any two of these columns. Is there a statistical test available to do that?
    I am not sure if ANOVA or T Test are relevant (I have heard about them so often but do not understand their working and application).
    Regards
    Last edited by true_friend; 05-16-2016 at 07:58 AM. Reason: typo

  2. #2
    TS Contributor
    Points: 17,749, Level: 84
    Level completed: 80%, Points required for next Level: 101
    Karabiner's Avatar
    Location
    FC Schalke 04, Germany
    Posts
    2,540
    Thanks
    56
    Thanked 640 Times in 602 Posts

    Re: Find out significant variables

    I have a data table consisting of normalized frequencies (out of 100) of certain linguistic features.
    I do not understand this, I'm afraid. What is the study about? And what did you actually do, how were the data collected? And, in addition, what do you mean by normalized frequencies out of 100?

    With kind regards

    K.

  3. #3
    Points: 2,096, Level: 27
    Level completed: 64%, Points required for next Level: 54

    Posts
    12
    Thanks
    5
    Thanked 0 Times in 0 Posts

    Re: Find out significant variables

    My apologies for not explaining.
    I am studying English used in blogs and print media (newspapers etc.). As I said there are 34 linguistic features for which I have collected frequencies and made a table.
    A Normalized frequency means percentage in this context. So for example "First Person Pronouns" ("I") occur 2.15/100 words in News Blogs. Using percentage makes it easier to compare data samples of various sizes (for example News Blogs have 100,000 words while Columns have 40,000 words). [I know there are disadvantages of making such normalization, but that is ok for my purpose.]
    Since the table is quite long (34 different variables). I want to reduce this number to select only most significant variables. Here significant means the ones which have variation in values in all three (or two) columns. I can do it manually of course by reading each variable and comparing its values, and then deciding if there is a large difference present. For example:
    First person pronouns 2.155909474 2.232891832 1.30409894
    It seems that First person pronouns ("I") has quite high frequency in NewsBlogs and MetroBlogs (both of these are blogs written by people online). While the same variable has low frequency in Columns (a column is an opinion piece published in a printed newspaper). As I said earlier, I can do the selection of other such variables manually. I wanted to look for a statistical test to make it more reliable, consistent and personal bias-free.
    Hopefully I was able to explain my point of view.
    Regards

  4. #4
    TS Contributor
    Points: 17,749, Level: 84
    Level completed: 80%, Points required for next Level: 101
    Karabiner's Avatar
    Location
    FC Schalke 04, Germany
    Posts
    2,540
    Thanks
    56
    Thanked 640 Times in 602 Posts

    Re: Find out significant variables

    Maybe you don't use percentages (BTW is there particular reason for displaying 10 decimals places?) but frequencies instead. For each corpus you know how many words have a particular feature and how many don't (something like: in corpus A, 38500 words are not "I" and 400 words are "I"; in corpus B, 99.007 words are not "I" and 1200 are "I"). You could perform a 2x3 Chi² Test of association for each feature then. Maybe with Bonferroni-correction for multiple testing (36 features).

    Admittedly, I don't know how well Chi² works with huge numbers of observations and low relative frequencies of one characteristic.

    With kind regards

    K.

  5. #5
    Points: 2,096, Level: 27
    Level completed: 64%, Points required for next Level: 54

    Posts
    12
    Thanks
    5
    Thanked 0 Times in 0 Posts

    Re: Find out significant variables


    Hi
    Thanks a lot for explanation.
    It was unintentional to show 10 decimal points. For the time being I have reduced my data set to two (instead of three). So it is News Blogs and Columns. I did following to see if some variables show more than usual variation.
    NewsBlogs Columns NewsBlogs/Columns*100
    First person pronouns 2.156 1.304 165
    Hedges 0.019 0.012 152
    Indefinite pronouns 0.062 0.040 155
    Second person pronouns 0.579 0.157 368
    Stranded preposition 0.094 0.059 160
    Subordinator that deletion 0.168 0.100 167
    WH-clauses 0.066 0.039 168
    As the formula shows I divided one over the other and multiplied it with 100 to make it easier to read for me. Then I selected percentages which were above 150 or below 50 (only one variable was near with 53 but I have discarded it for the time being). I am not sure what it is called (maybe coefficient). But I thought this will be a good starting point.
    Now about chiq-square, I think it does something like this as well. As per your suggested method i would need frequency of a word and total number of words (or is it total - variable's frequency?). I have total number of words in corpus, but raw frequencies are not available. That is, only percentages are available. I think now i need to explore chi-square and if it can be applied to this situation.
    Regards

+ Reply to Thread

           




Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts






Advertise on Talk Stats