F-Test for Regression Analysis

smokiestprune

New Member
Hi,

I performed a linear regression with a first order and second order polynomial on the same dataset. I now want to determine if the more complex (second order polynomial) is significantly better than the first one. I'm doing this within the scope of a design standard for mechanical tests. An example is worked out in the standard which I tried to follow, but I assume I'm doing something wrong (or probably less likely) there is a mistake in the example. The dataset of the example is the following:

I'm performing my regression analyses in Python. For the 1st order linear regression I have:

Code:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn.preprocessing import PolynomialFeatures

df_sn = df.copy()

# add a log_strain and log_life column
df_sn["log_strain"] = np.log10(df_sn["strain_range"])
df_sn["log_life"] = np.log10(df_sn["fatigue_life"])

# y: The variable which is dependent on x.
# --> log_life is the dependent variable
y = df_sn['log_life']

# X: The independent variable.
# --> log_stress is the independent variable

lin_reg = sm.OLS(y, X).fit()

print(lin_reg.params)
lin_reg.summary()
For the 2nd order linear regression I have:

Code:
# Create a 2nd order polynomial
poly = PolynomialFeatures(degree=2)

# Fit & transform to standardized range
poly_feats = poly.fit_transform(np.array(df_sn['log_strain']).reshape(-1, 1))

X = poly_feats
y = df_sn.log_life

# Ordinary least squares regression with 2nd order polynomial
quad_reg.summary()
According to the standard, the need for a quadratic model can be determined using a "general linear test". The general linear test uses the sum of squares error, RSSE, and the corresponding degrees of freedom for each of the fitted linear and quadratic expressions. They give the following formula:

Now for the dataset in the first image, they report these results:

I get the same results for the R^2 and Std. Dev. so I'm pretty sure my models have been fitted correctly and in the same way. But when I try to determine the F calculated I'm unable to figure out what I'm doing wrong.

When I try to calculate it according to the formula of the standard as such:

Code:
>>> rsse1, rsse2 = lin_reg.ssr, quad_reg.ssr
>>> v1, v2 = lin_reg.df_resid, quad_reg.df_resid
>>> print(v1, v2)
17.0 16.0
>>> print(Fcalc)
-0.03330950467712325
I get something completely different than what is presented in the standard. I then tried to determine the F calculated like this:

Code:
>>>from statsmodels.stats.anova import anova_lm

>>>print(anovaResults)
df_resid       ssr  df_diff   ss_diff          F    Pr(>F)
0      17.0  1.484616      0.0       NaN        NaN       NaN
1      16.0  0.739930      1.0  0.744686  16.102855  0.001005
Which gives a completely different F. I'm not even sure if I'm using anova correctly. Could somebody help me figure out what the correct way is of determining if the 1st order and 2nd order model are significantly different?

hlsmith

Less is more. Stay pure. Stay poor.
what is the provided answer, please post the question as presented in book. In particular to explain why you are logging values.

hlsmith

Less is more. Stay pure. Stay poor.
P.S., I image this is for educational purposes, but a review of the residuals and visualizing the data is typically superior to testing if a model has better fit.

smokiestprune

New Member
The answer provided is the F calculated in the last image 8.02. I'm logging the values because it's a linear regression on a log-log plot. It's a regression analysis to determine an S-N curve. In the example here they use a strain-life curve but it's the same principle (just strain instead of stress for the y-axis)

E.G.

The reason I need to do the test is because the standard says it is required when reporting a test report of the data.