Erasing duplicate variables (by date) from a large dataset

Hi there,

I am university student doing my Honours in Economics this year, I am currently stuck in the data management process of my research project.

I have been provided a dataset by NZQA of over 3 milloin observations.

Each observation contains a student number, date of birth, date that the data was recorded and a number of other control variables.

Basicaly, my problem is that there are many instances where data for a student has been recorded 3 or 4 years in a row, meaning that there are 3 or 4 observations for that single student, however I only require one observation per student.

The observation that is most important to me is the most recent one (i.e. if data was recorded for a student in 2003, 2004, and 2005 I would like to drop the 2003 and 2004 observations for that student).

So in a nutshell, I want to erase all the duplicates of student number and be left with the most recent observation. I have spent a fair amount of time looking for a similar example but I have had no such luck, I am relatively new to Stata which is why I am struggling on this part of my research project.

Thanks in advance for any help or comments,


Does every student have data for every year, or just some years? (i.e is the most recent year the same for everyone?)
Thanks for the quick reply!

The data varies over about 10 years,so there might be data on one student from 2003, 2004,and 2005, while there might be data for a second student from 2006,2007,2008.

The observation that I would want from the 1st student would be that of year 2005, while for the second student I would require the observation from 2008.

Is this clear? I can give a more detailed explination of the data if you need a better idea about it.


by student (year), sort: keep if _n==_N
Hi there, thanks for the reply

If you see my reply to the above comment, not all students have the same years of data observations.

let me be more specific

The last years that observations were taken for my dataset was in 2010, however this wasn't for every student.

I believe that the command you provided will only keep observations of people for whom data was taken in 2010. So for people who had data collected in 2003,2004 and 2005, will not be kept in the data set, and I need to have them still.

Thanks for the help though,

If they each have exactly three years of data (do they?) this can be solved using:

bys studentnumber year: egen count=seq()
keep if count==3

Edit: Sorry just reread your first post and realised you said they had 3 or 4 years, this will only help if they all had the same number of years, I will have a think further.


. list, sepby(student ) noobs

  | student   year   x |
  |       1   2003   1 |
  |       1   2004   2 |
  |       1   2005   3 |
  |       2   2006   4 |
  |       2   2007   5 |
  |       2   2008   6 |

. by student (year), sort: keep if _n==_N
(4 observations deleted)

. list

     | student   year   x |
  1. |       1   2005   3 |
  2. |       2   2008   6 |
Oh yes [by student (year), sort: keep if _n==_N] does indeed work, my apologies.

The reason I was thrown off was because of a prior amendment that I had made to the data which included removing all people who were above a certain age. When I typed in your code last night and saw what was displayed in the dataset I had not taken into consideration the effect of my previous data manipulations.

Thanks for your help,