Doublecheck: MMerge on Three Variables?

#1
Thanks for taking the time to read this!

The Situation
I want to merge two datasets together. For the sake of example, the datasets contain the following info..
Master: SpecificStockPicks, Player, CompanyName, TickerSymbol, StartMonth
Using: CompanyName, TickerSymbol, S&PMonth, StartMonth

Qualitatively, (DataSet2's) StartMonth is exactly the same as the S&PMonth (the month a company was in the S&P500).

In the Master, there are many players. And, each player could have chosen the same stock albeit at the same or at different times. In the Using, many companies show up in the S&P500 month after month (S&PMonth). By hand, I made sure that the Master's company names matched the Using's. However, the TickerSymbols may not match.

What I Did
After performing the following mmerge, I wind up with more observations in the master dataset than before - meaning I can't have done the merge correctly, right?
Code:
mmerge CompanyName StartMonth using "Using"
So, I decided to mmerge on three variables.
Code:
mmerge CompanyName StartMonth TickerSymbol using "Using"
After dropping some observations that I don't need, I wind up with exactly the same amount of observations. As I am inexperienced with Stata, I myself cannot confirm whether I have gotten the merge exactly right. That is, for every SpecificStockPick, does my Master data set now show whether it was in the S&P500 or not? Are there any possible problems?

What do you TalkStats frequenters think??
 

bukharin

RoboStataRaptor
#2
I am not sure I entirely understand what you're trying to do, but I can offer some general advice for mmerge:

- always specify the expected relationship between matching variables (eg type(n:n), type(1:n) etc)

- always tell it what you want it to do with non-matches (eg unm(master))

- if S&PMonth in the using dataset is supposed to match StartMonth in the master dataset, use the umatch() option of mmerge to specify this condition. I can't tell from your post whether StartMonth and S&PMonth in the using dataset are the same, and whether this is because you just copied S&PMonth for matching purposes

- I think you're saying that TickerSymbol may not necessarily match even if the records should be matches - in this case you should definitely not include TickerSymbol in your list of matching variables, because you're excluding some matches that should occur

So if I understand your question correctly, you may be after this kind of merge:
Code:
mmerge CompanyName StartMonth using "Using", type(1:1) unm(master) umatch(CompanyName S&PMonth)
 
#3
1) What exactly does "unm()"? So, if the data and using don't match, it "unm()" doesn't match? Wouldn't the same thing be accomplished by dropping the unmatched values in a subsequent command?

2) In the Using dataset, the S&PMonth and StartMonth are duplicates of each other. I did this so that Using's StartMonth would match to Master's StartMonth.

3) Yes, TickerSymbol may not match. But, if I only match on "CompanyFullName" and "StartMonth", then I get more observations than I had started with.
 

bukharin

RoboStataRaptor
#4
1. As documented in -help mmerge-, unm() tells -mmerge- what to do about unmatched observations.

If you say unm(master) it means that you keep every observation in the master dataset, even if it's not matched to an observation, but only keep matched observations from the using dataset. unm(using) is the reverse, and unm(both) means you don't throw out any observations. unm(both) is the default and means nothing will be deleted, and unm(none) means you only want matched observations.

It comes down to what you're trying to do. For example, if you have a whole bunch of observations in master, and you're "looking up" additional info from using, then you would want to match with type(n:1) and unm(master). You wouldn't want unm(both) because none of the unmatched values in the "lookup" table are relevant to you.

Yes, you can use unm(both) and then drop the unwanted observations, but this takes longer and uses more memory. It also exposes you to the risk of dropping the wrong observations by accident. I also personally find that forcing myself to specify unm() and type() makes me think more carefully about my data structure and what I'm actually trying to achieve.

2) I thought so. You don't need to do this; you can use the umatch option as per my example

3) That's because you're matching type(n:n) or type(1:n). If I understand your intention you need to use either type(n:1) or type(1:1)
 
#5
I appreciate the help!

After trying your method, the merge still isn't working. Generally, the unm() seem unnecessary. And the types are off. For the moment, I'm going to try to ensure that all the tickersymbols in both files are the same. THEN, I hope I can merge on three variables.

Here is specifically what Stata says:

Code:
. merge 1:1 companyfullname startmonth using "C:\Users\MightyMouse\Desktop\Kevin Desktop\Senior Year\
> Thesis\Data\S&P\S&P500_0324_Start.dta"
variables companyfullname startmonth do not uniquely identify observations in the master data
r(459);

. mmerge companyfullname startmonth using "C:\Users\MightyMouse\Desktop\Kevin Desktop\Senior Year\The
> sis\Data\S&P\S&P500_0324_Start.dta", type(1:n)
match-var in master should form a key
duplicate values in match-var(s)
__FREQ already defined
r(110);

. mmerge companyfullname startmonth using "C:\Users\MightyMouse\Desktop\Kevin Desktop\Senior Year\The
> sis\Data\S&P\S&P500_0324_Start.dta", type(1:1) unm("C:\Users\MightyMouse\Desktop\Kevin Desktop\Seni
> or Year\Thesis\Data\CAPS Kevin\Practice\MMerge_0322_03_Pre-TestData.dta") unm(companyfullname spmon
> th)
option unm() not allowed
r(198);

. mmerge companyfullname startmonth using "C:\Users\MightyMouse\Desktop\Kevin Desktop\Senior Year\The
> sis\Data\S&P\S&P500_0324_Start.dta", type(1:1) unm("C:\Users\MightyMouse\Desktop\Kevin Desktop\Seni
> or Year\Thesis\Data\CAPS Kevin\Practice\MMerge_0322_03_Pre-TestData.dta") unm(companyfullname)
option unm() not allowed
r(198);

. mmerge companyfullname startmonth using "C:\Users\MightyMouse\Desktop\Kevin Desktop\Senior Year\The
> sis\Data\S&P\S&P500_0324_Start.dta", type(1:1) unm("C:\Users\MightyMouse\Desktop\Kevin Desktop\Seni
> or Year\Thesis\Data\CAPS Kevin\Practice\MMerge_0322_03_Pre-TestData.dta")
C:\Users\MightyMouse\Desktop\Kevin Desktop\Senior Year\Thesis\Data\CAPS Kevin\Practice\MMerge_0322_03
> _Pre-TestData.dta invalid for unmatched(Both Master Using None)
r(198);

. mmerge companyfullname startmonth using "C:\Users\MightyMouse\Desktop\Kevin Desktop\Senior Year\The
> sis\Data\S&P\S&P500_0324_Start.dta", type(1:1) unm(Master)
Master invalid for unmatched(Both Master Using None)
r(198);
 

bukharin

RoboStataRaptor
#6
Try:
Code:
mmerge companyfullname startmonth using "C:\Users\MightyMouse\Desktop\Kevin Desktop\Senior Year\Thesis\Data\S&P\S&P500_0324_Start.dta", type(n:1) unm(master)
 
#7
Nope, but I think the triple merge on tickersymbol companyfullname startmonth worked. Is there anything that might go wrong?

Code:
. mmerge companyfullname startmonth using "C:\Users\MightyMouse\Desktop\K
> evin Desktop\Senior Year\Thesis\Data\S&P\S&P500_0324_Start.dta", type(n
> :1) unm(master)
match-var in using data should form a key
duplicate values in match-var(s)
Maybe you matched up types n:1 and 1:n
 

bukharin

RoboStataRaptor
#8
Well, the error message is telling you that in the using dataset, there are multiple records with identical companyname & startmonth. If I understand your data structure properly (which is seeming unlikely!!) then each tickersymbol should relate to a particular company, and therefore adding tickersymbol to your merging variables list shouldn't change the fact that you have multiple entries per company in the using dataset. Did it work using type(n:1) or type(n:n)? If it worked with type(n:1) then I suspect you have an error in your using dataset.

Without clearly understanding your data structure I couldn't really say whether there is any downside to your approach, except to repeat my general advice that if specifying the type of match doesn't work, it probably means that your data is not structured the way you think it is - which of course is a recipe for unwanted effects.
 
#9
Bukharin, I appreciate your willingness to continually help! =)

I've attached a sample of the data from the using and master. Hope you can help me out somehow =/ (Please download b/c I'm going to take it off in a couple of days).

On the triple merge, that's my intuition too. Now that the tickersymbol and companyfullnames match completely, I should be getting the same problem as I had before - yet, I don't when I am able to (n:n) merge? Here's the code from the triple merge. The only thing that seems a little off is that I've never seen "code==-1" before. What does that mean exactly?:

Code:
. mmerge  companyfullname startmonth tickersymbol using "C:\Users\S&P500_0321.dta"

-------------------------------------------------------------------------------
merge specs          |
       matching type | n:n
  mv's on match vars | none
  unmatched obs from | both
---------------------+---------------------------------------------------------
  master        file | C:\Users
                 obs | 3493902
                vars |     22
          match vars | companyfullname startmonth tickersymbol  (not a key)
  -------------------+---------------------------------------------------------
  using         file | C:\Users\S&P500_
> 0321.dta
                 obs |  41813
                vars |      8
          match vars | companyfullname startmonth tickersymbol  (not a key)
---------------------+---------------------------------------------------------
result          file | C:\Users
> Practice\0325_MMergePickPlayerTick.dta
                 obs | 3514611
                vars |     29  (including _merge)
         ------------+---------------------------------------------------------
              _merge |  14622  obs matchvar==missing in master data  (code==-1)
                     | 2430051  obs only in master data                (code==1)
                     |  20709  obs only in using data                 (code==2)
                     | 1049229  obs both in master and using data      (code==3)
-------------------------------------------------------------------------------
 

bukharin

RoboStataRaptor
#10
Thanks for the example data. I still don't quite understand why you want to merge the datasets - ie what precisely is it that you need from both datasets? This would drive the type of match you're after.

As I mentioned earlier, if n:1 works with tickersymbol but not without tickersymbol, it means you have multiple entries of companyname + startmonth in the using dataset with different tickersymbols. Whether or not this is correct I can't say (although I must say it sounds like a problem)
 
#11
Hey bukharin, thanks for your attention again.

I've check several times before and after a triple mmerge, and am confident that the data is what I need it to be.

Thanks for your help!