# Rookie with Stata - problems with merging two data sets

#### larip

##### New Member
Rookie with Stata - merging data sets, calculating deviation, and adding up

There is a problem to merge two data sets. Data set A contains 1.5 million daily observation of nearly 6000 companies as opposed to data set B that comprises industry classification scheme (i=1,....49) based on SIC codes ranging from 100 to 9999. The aim is to merge the data sets and put each observation into its appropriate classification basket based on their SIC codes. How do I do this on Stata?

I tried to the following: . merge 1:m siccd using filename but it didn't work. I have both files on csv. format and the classification scheme was built on Excel. When I open the data set B on Stata I have all information in one column whereas in Excel there are two colums "siccd" and "siccd49". Could this be a problem?

Thank you so much!

Last edited:

#### bukharin

##### RoboStataRaptor
If you tried something and it didn't work, it's helpful to show exactly what you tried and what the output was (and why it wasn't what you wanted).

In any case I would suggest the following:
1. Save each dataset in Stata format
2. Open dataset A
3: Merge in the info from dataset B:
merge m:1 siccd using filename, keep(match master)

I don't really understand what you mean about having all the information in one column in Stata vs 2 columns in Excel. Does the Stata dataset contain everything you want? If yes, that's good. If no, you may have a problem with your importing from Excel to Stata.

#### larip

##### New Member
This is what I am trying to do: for each observation in the master data, find the corresponding observation in the
using data.

Master Data (data set A)
Company 1, SIC 150
Company 2, SIC 4590
Company 3, SIC 3456
.....
Company 6000, SIC 8880

The Using data (data set B)
SIC 150, 1
SIC 4590, 2
SIC 3456, 2
....
SIC 8880, 17

(i=1,2,3,4,......,49)

The using data in Stata (information in the same column):
v1
siccd;siccd49
100;1
101;1
102;1
103;1
104;1
105;1
106;1
107;1
108;1
109;1
110;1

How can I move siccd49 to v2 in Stata? In excel siccd and siccd49 have their own colums but not in Stata. Thank you!

#### bukharin

##### RoboStataRaptor
OK I see now; the problem is that the transfer from Excel to Stata has put both variables (siccd and siccd49) into one variable called v1. The best solution would be to re-do your import from Excel into Stata so that you end up with 2 variables (siccd and siccd2).

However, you can extract the data from within Stata if you like. I'm not really an export at string manipulation but here's one kludgy approach:
gen siccd=real(substr(v1, 1, strpos(v1, ";")-1))
gen siccd49=real(substr(v1, strpos(v1, ";")+1, .))

(you'll also want to delete the observation, ie row, that contains "siccd;siccd49" instead of actual data)

#### larip

##### New Member
Thanks Bukharin! You have been so helpful so far! I managed to divided the data into two colums. However, I am not sure what's wrong now.

. merge 1:m siccd using siccd49
variable siccd does not uniquely identify observations in the master data
r(459);

Is there a problem with my master data or the using data?

#### larip

##### New Member
Thanks Bukharin! You have been so helpful so far! I managed to divided the data into two colums. However, I am not sure what's wrong now.

. merge 1:m siccd using siccd49
variable siccd does not uniquely identify observations in the master data
r(459);

Is there a problem with my master data or the using data?
I was able to resolve the problem by iteration. I used command merge m:1 siccd using siccd49. However, here's where the fun begins.

1. I want to calculate the daily firm-specific residual. Does this make sense:
by date sicff49, sort: egen newvar=mdev(ret)

I want to arrange observations by dates and siccff49 and then calculate the daily firm specific residual that is the difference between (the return on day s in month t of stock j that belongs to industry i) and (the value weighted return of industry i on day s in month t). Remember that sicff49 represents individual industries in this case (i=1,2,3,.....49).

2. Next I want to obtain the month-t residuals of stock j that belongs to industry i (i=1,2,3,....49).

#### bukharin

##### RoboStataRaptor
Yes, Stata can definitely do it. I'm not sure you egen command is correct though... it looks like you first have to calculate the "value weighted return of industry i on day s in month t". I would need to see a formula to help you calculate this. Once you've calculated it, the residual is obviously easy enough to calculate as the difference between the return and the value-weighted return.

#### larip

##### New Member
Okey! It doesn't sound too complicated. The only problem though is the fact that I have never really used Stata.. This is what I am trying to do:

E = the daily firm-specific residual
Rijst = the return on day s in month t of stock j that belongs to industry i
Rist = the value-weighted return of industry i on day s in month t
wijt-1 = the market capitalization at the end of the previous month to obtain the weights for stocks within the industry
IV = idiosyncratic volatility
j = an individual stock
i = an industry (i=1,2,3,…..49)

#### bukharin

##### RoboStataRaptor
Okay, so basically you want to calculate Eijst - but how do you do the weighting for Rist? (I'm not an econometrician, by a long short - sorry)

#### larip

##### New Member
Okay, so basically you want to calculate Eijst - but how do you do the weighting for Rist? (I'm not an econometrician, by a long short - sorry)
I use market capitalization as weights in Rist. I know the price and the number of shares outstanding on a daily basis, so I can calculate the market capitalization for individual stocks on day s in month t.

Last edited:

#### larip

##### New Member
I have gotten pretty far!

This is what I have done so far:
. merge m:1 siccd using siccd49

1.

. by date siccff49, sort: egen industryreturn=mean(ret)
• I used normal average since I couldn't figure out how to use weighted average
. gen firmresidual=ret-industryreturn

2.

. gen firmres2=firmresidual^2
. tostring date, generate(datestr)
. gen month=substr(datestr,5,2)
. destring month, replace
. by permno month, sort: egen IVj=total(firmres2)
• permno is a code in my dataset that specifies individual companies

From here onwards I have two steps left.

3.

Calculate the value-weighted average idiosyncratic volatility for each industry, where market capitalizations at the end of the previous month are used to obtain the weights within the industry.

. gen marketcap=prc*shrout
. by sicff49 month, sort: egen industry_cap=total(marketcap)
. by permno month, sort: gen IVindustry=total(IVj*IVj(_n-1)/industry_cap)

Can someone advise what's wrong with the final step? Is the approach I am pursuing right to calculate the weighted-average?

4.

Final step is to calculate the value-weighted average of the monthly industry IV to obtain IV across all firms (sicff49=1,2,3,.....49) in a given month t.

Thank you!

#### bukharin

##### RoboStataRaptor
by permno month, sort: gen IVindustry=total(IVj*IVj[_n-1]/industry_cap)