Cyclical Data Transformation

For data that may look linear but is in fact cyclical (time most commonly), is there like a sine function based data transformation that has been proven to work well? I feel kinda silly making dummy fields, and now that I'm playing around with multiple imputation, I'm worried that if I'm missing a month, the multiple imputation may make both march and may 1 if I dummy the data.

Thanks, brain trust
So I grabbed Google Flu data, which is seasonal, from I performed a traditional dummy variable analysis, using aggregate United States Flu values:
lm(formula = Flu ~ FEB + MAR + APR + MAY + JUN + JUL + AUG +
SEP + OCT + NOV + DEC, data = data)

Min 1Q Median 3Q Max
-1987.8 -406.1 -146.7 38.0 5544.0

Estimate Std. Error t value Pr(>|t|)
(Intercept) 2431.3 151.4 16.063 < 2e-16 ***
FEB 545.5 218.9 2.492 0.0131 *
MAR -413.0 217.2 -1.902 0.0580 .
APR -1226.3 217.2 -5.646 3.16e-08 ***
MAY -1419.0 212.6 -6.675 8.56e-11 ***
JUN -1562.6 226.6 -6.896 2.17e-11 ***
JUL -1667.2 222.5 -7.493 4.59e-13 ***
AUG -1543.6 220.6 -6.996 1.16e-11 ***
SEP -978.7 224.5 -4.360 1.67e-05 ***
OCT -459.2 214.1 -2.145 0.0325 *
NOV -284.1 215.6 -1.318 0.1883
DEC 220.7 217.2 1.016 0.3102
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 908.1 on 389 degrees of freedom
Multiple R-squared: 0.3988, Adjusted R-squared: 0.3818
F-statistic: 23.45 on 11 and 389 DF, p-value: < 2.2e-16
You can see that JAN (the intercept) and FEB are positively corellated with Flu, and APR-OCT are negatively correlated with flu. Somewhat common sense, although I thought I'd see more in Nov, Dec, and Mar. Whatevs.

Then, I performed the following transformation on the date: COS((2*pi)*([day of year]/365)), and regressed the transformed value against Flu:
lm(formula = Flu ~ Transform, data = data)

Min 1Q Median 3Q Max
-1315.1 -467.2 -190.4 38.8 5520.6

Estimate Std. Error t value Pr(>|t|)
(Intercept) 1689.10 45.81 36.87 <2e-16 ***
Transform 1001.09 65.13 15.37 <2e-16 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 916.5 on 399 degrees of freedom
Multiple R-squared: 0.3719, Adjusted R-squared: 0.3703
F-statistic: 236.3 on 1 and 399 DF, p-value: < 2.2e-16
I specifically chose Cosine because this would make it peak towards the end/begining of the year, and nadir in the summer, giving me the expeceted flu curve. You can see the result I got, and at is significant, but how do I compare the two results?

Last edited:
So I used a random 60% sample of the flu data as training sets for both of my models above, recrunched the regressions, and then calculated the models with the 40% test set. I then did a paired T test of the actual flu numbers versus each of the calculated results by each of the models:

Test---------- Mean------------ CI95 lo--------- CI95 hi--------- Median---------- 25th%----------- 75th%----------- p (Paired T)-------
Observed------ 1631.07843137255 1485.96825901115 1776.18860373395 1508------------ 902------------- 2021------------ (ref)--------------
Dummy--------- 1785.18235294117 1663.1267155889- 1907.23799029345 1933.6---------- 993.6----------- 2515.4---------- 0.0108451692556381-
Transformation 1805.18174452753 1681.27560127107 1929.08788778399 1889.55718530144 998.370390406813 2540.19308750762 0.00305881920481353

It seems that the transformation model works better at predicting than the dummy model.

What say you, brain trust? Am I reaching?
Last edited:


Ninja say what!?!
It seems that the transformation model works better at predicting than the dummy model.

What say you, brain trust? Am I reaching?
I think that if your goal is prediction, you shouldn't worry about reaching, but rather how well your model is predicting. Rather than a t-test, why not use cross-validation? For the 40% test set, use the model derived from the training set and calculate predicted values. Then see how far off your actual values are from the predicted ones and average over the set.
Thanks link for answering! As you say, the proof is in the putting. I would like to know though if some dedicated statisticians have proven mathematically that such a transformation is valid.
Th problem with this transformation is that if you don't know if there is cyclical relationship, you have to use a sine and cosine transformation, regressed independently, and pick the better result, which REEKS of post hoc analysis. Also, by picking the better of the two transformations, it is conceptually possible that an artifactual cyclality will appear, that has a nice p value, but doesn't actually occur in real life.
That's why I want to know if someone has blazed this trail already. not necessarily with sine & cosine, but something that'll turn cyclical into continous data.



TS Contributor
You have stumbled on something called a Fourrier transform (excuse spelling)

every function Y(x){not just cyclical} can be represented in the form you see in the paper. The sin/cos series you see is called an 'orthoganol bases' (excuse spelling). How many sines cosines will you include in your model, infinity?, 2, 3, maybe 22.
So i did a fourier transform on the flu data with R, and I think i got imaginary numbers back. here's some samples:

Fraction year = Transformed
0.742465753 = 199.54794521+ 0.00000000i
0.761643836 = -2.24882026+ 0.80043764i
0.780821918 = -0.55672289+ 0.75627255i
0.800000000 = -1.81340395+ 0.95325734i

what on earth do i do with this??
Last edited:


Ambassador to the humans
Well it sounds like you applied a fourier transform to a specific data set and yes you will most likely get imaginary results.

I'm guessing what you wanted was some sort of fourier approximation?
Thanks for replying, sorry for the delay in my response, i think my spam filter is catching my subscriptions to this thread.

I searched for Fourier approximation in R, and I can't find much practical stuff. Is there a way I can do this magical mystery fourier approximation?