Splines

noetsi

Fortran must die
#1
I have several variables that I think have non-linear effects. One way to address that is to generate splines. But as far as I can tell these do not generate anything like a slope. How do you interpret splines in terms of its impact on the DV?

Since I am brand new to this code I wanted to see if I did it correctly. All the class variables are dummies that can not be non-linear.


Code:
proc gampl data=dora.fin plots seed=12345;
CLASS edclo ethd  femaled  SSDIALLOWED  SD
SSIALLOWED  WhiteD ;
model q2wage = spline(inserv) spline(sum_of_tuition) spline(rate)
;

output out=outs;
run;
Following advice I saw I compared the AIC I got with this [non-linear I think or general additive] model to the one I got with proc genmod which does not assume non-linearity the way I ran it. The non-linear proc genmod model had a lower AIC (359317 versus 360676)

Does this mean I should not worry about the non-linear effect in interpreting the interval predictors I suspect are non-linear?
 
Last edited:

hlsmith

Less is more. Stay pure. Stay poor.
#2
How come the other terms aren't listed in the model? Did it assume they were in there and ran anyways?

I just look at the outputted plots. Please post those. On the plots there should be a "df=" I believe. These will tell you how many ~polynomial like bends in the relationship. Typically, I use my judgement on whether I think there really is a non-linear relationship based on background knowledge and what the plot looks like.

I typically don't find a non-linear relationship and if I do, the variable is usually a confounding term that I don't neccesssarily need an estimate for - I just need to adjust for it. Otherwise, if you truly had a non-linear relationship to interpret, they are kind of tricky to explain beyond using images.
 

noetsi

Fortran must die
#3
I don't know this proc at all hlsmith. Until you mentioned it I had never heard about it. So I don't know why it did or did not do anything.

I am not sure this is what you were asking about.

One problem is that in my field there is little theory or statistical analysis, or if there is I don't know where it is published. I have spent a lot of years looking. Simple linear regression is as complicated as it is done.
 

Attachments

hlsmith

Less is more. Stay pure. Stay poor.
#4
They look fairly linear. The rate fig seems a little quadratic, But the confidence bands are fairly wide on the right tail. You always say you have a bunch of data. So trichotomize data into three partitions. Use the first two to find the best transformation for rate, then fit that one in the holdout set as a spline and see if that fits is better!
 

noetsi

Fortran must die
#5
Sorry for my ignorance here hlsmith, I am not sure what you mean by "then fit that one in the holdout set as a spline and see if that fits is better!"

I understood you use the first two thirds of the data to decide the transformation (although what if the first 1/3 of the data disagrees with the second portion of the data). :p

thanks. I need to learn about splines.
 

hlsmith

Less is more. Stay pure. Stay poor.
#6
Is this time series data? If not three random splits would be the same. If time series data splits would have to be strategic!
 

noetsi

Fortran must die
#7
It is not time series. I am looking at how various predictors predict income of our customers (I guess you could argue it is time series since Y occurs in the future relative to x, but I am not treating it as time series data).
 

hlsmith

Less is more. Stay pure. Stay poor.
#8
So you can randomly split your data into 2 or 3 partitions. Given it is random, all three splits should be the exact same on average across variables. :)
 

noetsi

Fortran must die
#9
So do I use the first two portions of the data to chose a transformation and then run it against the third unused data set to generate the coefficients. Or against the whole data set?
 

hlsmith

Less is more. Stay pure. Stay poor.
#10
The approach approach I would use is fit a bunch of transformation on the first partition. Then score the second partition and select the transformation with the best fit. Then get estimates from the third partition by applying the transformation in that model and partition.

@Jake - any comments?
 

noetsi

Fortran must die
#11
Won't using one third of the data to do the final estimates, I have 20 thousand cases, change the coefficients of all the variables?
 

noetsi

Fortran must die
#12
So is the way splines work, you interpret the coefficient at different levels of the spline rather than just one coefficient with a normal regression? And that you control for other variables as normal in regression when a spline is involved?