# Splines

#### noetsi

##### Fortran must die
I have several variables that I think have non-linear effects. One way to address that is to generate splines. But as far as I can tell these do not generate anything like a slope. How do you interpret splines in terms of its impact on the DV?

Since I am brand new to this code I wanted to see if I did it correctly. All the class variables are dummies that can not be non-linear.

Code:
proc gampl data=dora.fin plots seed=12345;
CLASS edclo ethd  femaled  SSDIALLOWED  SD
SSIALLOWED  WhiteD ;
model q2wage = spline(inserv) spline(sum_of_tuition) spline(rate)
;

output out=outs;
run;
Following advice I saw I compared the AIC I got with this [non-linear I think or general additive] model to the one I got with proc genmod which does not assume non-linearity the way I ran it. The non-linear proc genmod model had a lower AIC (359317 versus 360676)

Does this mean I should not worry about the non-linear effect in interpreting the interval predictors I suspect are non-linear?

Last edited:

#### hlsmith

##### Less is more. Stay pure. Stay poor.
How come the other terms aren't listed in the model? Did it assume they were in there and ran anyways?

I just look at the outputted plots. Please post those. On the plots there should be a "df=" I believe. These will tell you how many ~polynomial like bends in the relationship. Typically, I use my judgement on whether I think there really is a non-linear relationship based on background knowledge and what the plot looks like.

I typically don't find a non-linear relationship and if I do, the variable is usually a confounding term that I don't neccesssarily need an estimate for - I just need to adjust for it. Otherwise, if you truly had a non-linear relationship to interpret, they are kind of tricky to explain beyond using images.

#### noetsi

##### Fortran must die
I don't know this proc at all hlsmith. Until you mentioned it I had never heard about it. So I don't know why it did or did not do anything.

One problem is that in my field there is little theory or statistical analysis, or if there is I don't know where it is published. I have spent a lot of years looking. Simple linear regression is as complicated as it is done.

#### Attachments

• 35.5 KB Views: 4

#### hlsmith

##### Less is more. Stay pure. Stay poor.
They look fairly linear. The rate fig seems a little quadratic, But the confidence bands are fairly wide on the right tail. You always say you have a bunch of data. So trichotomize data into three partitions. Use the first two to find the best transformation for rate, then fit that one in the holdout set as a spline and see if that fits is better!

#### noetsi

##### Fortran must die
Sorry for my ignorance here hlsmith, I am not sure what you mean by "then fit that one in the holdout set as a spline and see if that fits is better!"

I understood you use the first two thirds of the data to decide the transformation (although what if the first 1/3 of the data disagrees with the second portion of the data).

thanks. I need to learn about splines.

#### hlsmith

##### Less is more. Stay pure. Stay poor.
Is this time series data? If not three random splits would be the same. If time series data splits would have to be strategic!

#### noetsi

##### Fortran must die
It is not time series. I am looking at how various predictors predict income of our customers (I guess you could argue it is time series since Y occurs in the future relative to x, but I am not treating it as time series data).

#### hlsmith

##### Less is more. Stay pure. Stay poor.
So you can randomly split your data into 2 or 3 partitions. Given it is random, all three splits should be the exact same on average across variables.

#### noetsi

##### Fortran must die
So do I use the first two portions of the data to chose a transformation and then run it against the third unused data set to generate the coefficients. Or against the whole data set?

#### hlsmith

##### Less is more. Stay pure. Stay poor.
The approach approach I would use is fit a bunch of transformation on the first partition. Then score the second partition and select the transformation with the best fit. Then get estimates from the third partition by applying the transformation in that model and partition.

#### noetsi

##### Fortran must die
Won't using one third of the data to do the final estimates, I have 20 thousand cases, change the coefficients of all the variables?

#### noetsi

##### Fortran must die
So is the way splines work, you interpret the coefficient at different levels of the spline rather than just one coefficient with a normal regression? And that you control for other variables as normal in regression when a spline is involved?

#### noetsi

##### Fortran must die
When you are using a method such as general additive model and specifying some variables as splines, are you still controlling for these variables for other variables in the model that are not splines?

#### hlsmith

##### Less is more. Stay pure. Stay poor.
Yup, it is an adjusted (conditional) model.