Drowning in data

I'm a GP, and am interested in performing better analysis on patient data.

The raw data is around 100mb of data.

I am fairly adept by now at bodging scripts together, and I'm using perl at the moment to analyse it and graphing the values. That's working very well.

When looking at an individual's data, the key question is prediction - where is the data going. Obviously it isn't accurate and it has a wide margin for error, but I'm not looking for a value. What I want to try to do is to categorise a string of patient data into

or decreasing.

I then want to colourcode the points on the graph according to trend

You don't have many data points- perhaps 6 or 7 spread out over a few years.

For now I'm using least squares, and using slope. But that isn't ideal and I assume would miss an U or n shaped pattern of results.

I appreciate that the accuracy will be questionable. But the key thing here is that since it will be compared to 5000+other data streams, what I want to try to do is identify which sets don't fit in with the pattern of the others.

So a slope value that is very different to another 5000 would be worth a closer look.

So that would then be coloured red.

So what better techniques than least squares would exist to do this.


Super Moderator
For a data set this size, over time, I would suggest looking into general additive models.

For descriptive purposes, you might want to try fitting a smoother (locally weighted: i.e. Lowess) - this will give you an idea of the overall trend in the data.
Whilst the dataset is big, the individual patient datasets are 1-20 datapoints each.

Lowess plots everything(I think)

What I want is a dashboard for a few values.

I'm no statistician but what I need is something that weighs recent things more than old things, and that weighs a trend lasting a year more than a trend lasting a day.

If that makes sense.
To break it down.

What I, and every other doctor has are 5,000 or so patients.

Each will have a series of tests done -and each series of tests will contain anything from 1 to 100+ individual values.

So if we think of a test - be it any test how do you highlight those patients who have an increasing or deteriorating trend of values.

The slope of a least squares analayis will do this crudely. But I'm looking for a better way.

I'm working with perl as a programming language - and the choice of tests is limited. But if I knew exactly the approach to use I could invest time in developing that.

I can weight my least squares - that could improve things by weighting towards more recent values - how could I learn more about that technique.


Ninja say what!?!
Yeah, least squares assumes independence, which in this case you do not have (you have measurements from same patients). I would recommend a GEE model here if all you want to do is determine the increasing or decreasing slope of the measurements over time.

One thing I should warn you about though is to watch out for confounders when you are doing this.
Confounders - ha - medical reults are nothing but a series of confounders.

I think the tricky thing for you stats guys is the idea that I don't really need rigour. I can't predict what a result will be in 1 week let alone 6 weeks. So I don't particulalry want to describe what will happen.

What I want to do is to describe what the results I've got - for 4000 seperate data streams, quickly, so that I can quickly flag which need closer attention.

All the worry about confounders and all that is done once you focus in on a result. But the first step is to decide which of the 4000 need focusing on -and that's what I'm thinking about now.

Reults can be


For example.

And If I saw those rusults, and high was bad, the last two would probably be the ones I'd want to look at first.

I guess I could loook at the slope between each value, and weigh each slope in accordance with, say sqrrt of the date.

.....that's the sort of lines of thinking I'm having.

Is there an established way of doing this.

And the intervals between tests will not be the same either.


Super Moderator
:mad: I guess I should have told you beforehand that I'm an epidemiologist/biostatistician. Its not a good feeling to see people looking down on my work.
I wouldn't take a comment that was followed up by "I think the tricky thing for you stats guys is the idea that I don't really need rigour" (!) as too dire a reflection on your field... Epidemiologists and biostatisticians do fantastic work :)