Help with thesis research on social bookmarking


New Member
Hi, I'm writing my thesis right now and I must admit that I underestimated the complexity of the required statistics a bit. My description is quite extensive, so I apologize in advance but I hope it might enable someone to give me some advice on the key issues..

I'm researching free tagging in a social bookmarking system and the units of analysis are the unique tags given to the URLs (i.e. a unique tag - URL combination is an unit). Tags are always given in the context of a bookmark (by a specific user) which in turn is assigned to a specific URL.
There are two different ways of assigning a bookmark (and the contained tags) to a URL: with suggestion of the 5 most popular tags other users have used to describe the URL (can be copied by clicking on them) given in the entry form or entirely without any suggestions/help. These two methods constitute my binary categorical variable I will use for comparison.

My hypothesis is that generally, those tags used first to describe a URL will be also used much more frequently than those that "come later" (i.e. appear the first time in later bookmarks for a specific URL). This is supported theoretically if you consider the laws of natural language and an unobservable "frequency in the minds" meaning that certain words are (independently between users) associated much more often with a certain content which leads to their earlier appearance as tags as well as their more frequent use as tags.
Secondly I assume that this relationship will be stronger if there are suggestions present, meaning that the suggested (=most frequent) tags will be copied from the very start, resulting in a even higher frequency of the earlier tags use.
In other words I'm trying to prove that the tags from earlier bookmarks influence the distribution of tags for a URL (by being used more often) when there are "most popular"-suggestions as compared to no suggestions, consequently distorting the collective description that users would give independently of each other.

The relationship of the independent "rank of the first appearance" and and the dependent "share of the tag-frequency in respect to all tags of the URL" (which I used instead of the simple frequency) is not linear, but follows more or less a power law distribution (early rank means very high frequency). This is indicated by the scatter plot and supported by the theoretical assumptions.

As I only dealt with relatively simple linear and logistic regressions and their assumptions so far I encountered a lot of problems I'm trying hard to get an answer for:

First there are some basic problems I encounter already if I compare the tags of both mode-of-entry-categories of only one URL with sufficient tags assigned:
1. Non-Linearity: As a remedy to non-linearity, I had to transform both of the variables to their natural logs, gaining a more or less linear relationship and normal distribution of the dependents residuals (transforming only the dependent variable doesn't help). Still I don't know if I don't overfit the model, losing the real relationship between the variables.
2. Ratio as dependent variable: As I use the percentage of the entries of a unique tag in relation to all tags assigned to an URL as the dependent variable, I'm not sure what problems this causes, as usually I wouldn't use a linear regression model but I guess a poisson regression (???) for counts/ratios, right? But since I log-transform the variable I wonder if this is still an issue?

3. considering the above problems, is it still feasible to use a linear regression model and transform the variables?

Because I do not have URLs with enough tags assigned to them with and without suggestions to make a valid comparison and as this would require a comparison for each url (some even don't have both types of tags) I chose to mix them to get a bigger sample, pretending they come from one big URL with and one without suggestions.

4. As my units (tags) come from different subgroups (URLs), I guess I should use a multilevel model, but I wonder if that is really necessary if you make the assumption that there is no logical reason for the effect of rank on frequency/share to be different in different URLs as long as they were entered in the same way (w or w/o suggestion). Alternatively, would it be a remedy to only compare groups of URLs with the same URLs in them, i.e. the no-suggestion part of tags of a group of URLs compared to their tags entered with suggestions?

5. It is in my view not possible to compare tags from URLs which were assigned different numbers of bookmarks (and therefore a different absolute number of tags) for various reasons (A. the percentage scores would be mixed so indicating a high share would mean the same for a URL with only two tags given as with 500 tags given, B. a different state of "development" of the distribution of a URL, etc.). Therefore I assigned them to groups (e.g. URLs with 30 to 40 BMs, 20 to 30....) and "cut back" those with more BMs, leaving out the BMs that were assigned to them later, so that I get only URLs with the same number of BMs to compare the regression lines between them. Statistically legitimate?

Ok, if someone read through all of this, I'm already thankful. Maybe you have an idea of what I'm trying to do and some suggestions came to mind. I'd appreciate any help.


In the process of developing the thesis it is very important to pay much attention to the thesis ideas which can help to create an outstanding piece of writing. Thesis idea is the purpose of the work, for what it was created and what is expected. Your thesis is good and will be useful for many web browsers.

Social Bookmarking