I am starting a nested case control study using a pre existing dataset. I want to look at patient factors associated with relapse of tuberculosis. I need to select my control group. I am not sure of the best approach. Would generating a random sample using Stata be correct or would incidence-density sampling be the best method? I cannot find a helpful online guide on how to accomplish this using Stata. Many thanks for your help

How many cases/controls do you want to have? How many do you have to select from?

Do you want to do a matched case-control analysis? You'll have more power. You can match one-to-one or one-to-two or ...

Your controls will be people who contracted TB exactly once? Your cases will be people who contracted TB twice, with the event time being the date of 2nd diagnosis?

Assuming the above, if you are doing a matched analysis, I think you want controls who had the first diagnosis BEFORE the case had his first diagnosis. And, you want the control alive at the time the case had his second diagnosis.

That would be best, if you are doing a matched analysis. Second best, I think you want controls to have longer follow-up time than cases at least, if you can't find a control who has an earlier 1st diagnosis. But make sure to analyze the effect of first diagnosis date.

If you are not doing a matched analysis, then you will have to adjust for first diagnosis date, and follow up time (for controls that is time from 1st diagnosis to last contact ie death or lost to study; for cases that is time from 1st diagnosis to 2nd diagnosis).

I have 120 cases and need to select controls from a cohort of about 4000. I was planning on matching 1:1 or 1:2 on the time since completion of the first TB treatment course. My controls are people who have caught TB once and have not had a subsequent 2nd episode of infection. I have looked at other studies and guides online and time (or incidence) density sampling for controls is recommended. I can find a lot about the background theory online but no straightforward guides of how to do it in Stata.

Ok, you might want to match on things that are important but you are not interested in their effect. eg age, sex, ethnicity, etc.

Do not match on things that you want to test.

That's the basics. As for doing a matching, I think it's up to you to write code yourself (btw I do not know STATA). I've written some code myself, but it's not the best. It takes a couple of runs through, which is because it requires a lot of interaction rather than a simple "submit code and let it run, then it terminates itself and is complete".

To get you started, I'd number each case (1 to 120), then do a loop for i=1 to 120. Inside the loop, you find all possible controls that can match to case i. eg find controls with an earlier 1st diagnosis than the case, and who are alive in the cohort at the time of the case's 2nd diagnosis. This is pretty much the bare minimum, to get you started.

Now, you'll have possible controls for every case. But you'll have to check that you didn't "double dip", ie a control matched to two different cases. How you handle such situations, depends on your particular case. Also, every case has at least one or two controls.
I have found an answer to the problem. Weighted maximum likelihood estimators as referred to by Manski.
