Where are you getting your data from? Is your data only available in mean and standard deviation form?
I'm making a program which, at one point, is supposed to estimate the probability of having a certain disease based on the patient's age. I'm going to feed the mean age at which the disease occurs, along with the standard deviation, into the program, but I can't figure out how to actually calculate the probability. I've been reading for a while now about probability, cumulative probability, normals, and standards, and I'm all lost.
For instance, let's say the mean age for having a heart attack in men is 58 years, with a standard deviation of ±15 years. Now if I have a 37 year old man for example, how can I calculate the probability of him having a heart attack?
Please keep in mind that I am neither a statistician nor a statistics student, so go really easy on me!
Where are you getting your data from? Is your data only available in mean and standard deviation form?
All things are known because we want to believe in them.
Well, I look up the data. I can look up other data if it's needed. It's just that after a while of researching I was under the impression that these two pieces of data are what you need to calculate probability. If there's a better or easier way, it'd be great.
Your calculator will be extremely crude and not sensitive if you ignore all other potential covariates. For example, risk factors can include diabetes, systolic bp, tot chol, sex, race, use of statin, use of antihypertensives, etc.
So any ol' individual solely based on age would get the same probability using your calculator. So say a 59 year old Caucasion female with no risk factors will get the same probability as a 59 year old African American male with a bunch of the above risk factors. Does that seem right and educational to a patient? You would need to find the average for every subgroup etc, and even then the approach would be less than optimal.
Or you could just use this online calculator/ application that already exists:
http://my.americanheart.org/professi...ubHomePage.jsp
I salute your zeal, but some times its best to leave it to the experts in the area.
Stop cowardice, ban guns!
I am taking into account other risk factors using their relative risk. It's just the age I can't figure out. Besides, this program is for academic purpose and will not be used by patients. If someone would be so kind as to help me with this one point.
Ideally you would determine risk using a multivariate proportional hazards regression model and convert your odds (beta coefficients of interest) to probabilities. Not using such an approach would neglect to account for multicollinearity, interactions, etc., and provide spurious results.
How are you going to merge your independently calculated probabilities for covariates together, for an overall risk?
Lastly, I know you want to get an answer, but you just have the average age of people with a heart attack, so you seem to be missing quite a bit of information. What percentage of people have a heart attack.
Stop cowardice, ban guns!
I don't merge them. I'm showing each of them individually. Look, forget about the program and about other risk factors for now. Let's just focus on age. I have read, that given the mean and standard deviation of a variable with a normal distribution curve, you can determine the probability of a point lying anywhere on the curve area. If you would please just tell me how to do that I would be grateful.
Last edited by justry; 10-24-2014 at 12:59 PM.
Do you have the source that says you can calculate it based on just those parameters?
It does not seem likely that you can calculate risk for a person who will never have a heart attack based on data from only people who had a heart attack. Any determined probability would have to be conditional on the person will have a heart attack during their life.
Stop cowardice, ban guns!
Well we can't really do it for a single point - you need to specify an interval (but that's easy enough when working with ages) but note that this information doesn't get you what you want.
For example let's say I have a random number generator that gives numbers between 1 and 100000. The probability of getting exactly "23532" is very small. Now let's say I keep track of the ages of the people that do somehow get a 23532. I could tell you the mean and standard deviation of their ages.
Obviously the age isn't related to what number they got. So knowing their age doesn't change the probability of them getting 23532.
This is analogous to what you're asking us to do (although in your case age probably is related the analogy still holds - you're looking at the wrong information). The distribution of the ages of people that have heart attacks isn't really the important information. What you need is for each age what is the probability of having a heart attack? That's a different question that you can't answer with the numbers you have.
Also even if we were to do to assume that calculation you asked us to do was the right thing to do (which it isn't) there isn't enough information as we don't know what the actual distribution is. All it would really tell you (if we knew the distribution) was GIVEN that somebody had a heart attack what the probability that they were a certain age is. That's not what you care about.
I don't have emotions and sometimes that makes me very sad.
Alright, I'm not going to get in the way of you learning more. But here's one way you can calculate the probability. It has assumptions you have to make, before you can really go ahead with reporting the probabilities.
Such as ignoring other risk factors, and that the "age for having a heart attack" (what about men who have multiple heart attacks?) follows a nice normal distribution.
Here's how you can calculate a probability:
1) Z-score: The formula is z = (your age - mean age of the distribution) / (standard error of the distribution). So for this example,
z = (37 - 58) / 15 = -1.400
2) Take the z-score, and find a table like this: http://www.normaltable.com/
In this table, you will find that for z=+1.400, the cumulative probability is 0.9192.
Do you get this far?
Now you do some calculations, exploiting the symmetry of the normal distribution.
Since the cum. prob for z=+1.400 is 0.9192, then the cum. prob for z=-1.400 is 1-0.9192 = 0.0808.
[you subtract from 1 because all probabilities add up to 1]
So what you have found, is that if you have a man who WILL have a heart attack, the probability of him having it by 37 years old is 8.08%.
All things are known because we want to believe in them.
Ah, I see what you mean Dason.
Well, how then does one estimate that probability? I mean, ignoring all other factors that have a bearing on getting a heart attack. We know that age is a strong risk factor for heart attacks. We know that heart attacks have an annual incidence of 785,000 cases. We know that the peak incidence of heart attacks is a at 55-65 years. Let's say for simplicity's sake that the mean is 60. The age at first heart attack curve is normally distributed, and the standard deviation is 12 years.
Now a person who is 50-60 years old is at the highest risk for a heart attack, correct? And a person who is 20 years old is at a much lower risk, since 95% of the people with heart attacks will lie within 2 standard deviations from the mean, right?
How do we quantify this risk? I don't need to necessarily calculate the actual probability, I just need to at least come up with a number or score to indicate the weight or bearing of the age factor in a certain person. Example a 60 year old man has a score of 10, while a 20 year old man has a score of say, 3? Is something like this possible?
Thanks for your patience!
Most all of your descriptive data are still for those who had a heart attack.
You need to have a population at risk (some may never have a heart attack), then do time to event analyses.
Look at the full manuscript in Circulation at the link I posted. They already did all of this. They also used log(age), (log(age)**2, (Log(age))*Tot Chol, (log(age)*HDL),... to calculate risk. Their risk for cardiac event was determine using multiple longitudinal data sets, to calculate risk based on time to event analytics.
Stop cowardice, ban guns!
Yes, this is actually what I was asking about. I remember having seen something like this before, but of course I couldn't remember the formula.
I think this will do nicely as far as my program is concerned, but I'm also interested in learning, so you're all saying that this would not be the proper was to do it. I'd love to know the proper way if possible. What is the data required? If it's easily obtainable, I can do it properly.
Thanks hlsmith, will read now.
Tweet |