Change Point Detection using Least Sum of Squared Residuals (Two Phase Linear Regression)

#1
I have cumulative sum of rainfall data and would like to detect the Change Point with Least Sum of Squared Residuals (SSR) using Two Phase Linear Regression Model.Here is the data-

Code:
a<-structure(list(DAY=1:200,CUMSUM=c(0.4975167,0.4975167,0.4975167,0.4975167,0.4975167,
0.4975167,0.55045359,0.6252087,0.68326339,0.77695034,0.77695034,0.77695034,0.77695034,
0.77695034,0.77695034,0.77695034,0.77695034,0.77695034,0.77695034,0.77695034,0.77695034,
0.77695034,0.77695034,0.77695034,0.77695034,0.77695034,0.77695034,0.78782371,0.78782371,
0.78782371,1.03950021,1.03950021,1.03950021,1.044249829,1.887166329,7.275246329,15.43777033,
15.86143903,20.02444103,25.91613603,26.27224823,31.62583723,31.62583723,31.62583723,
31.62583723,32.06164673,32.06164673,32.09113609,32.09113609,32.31486939,32.53086649,
32.69404529,32.69887801,32.69887801,32.69887801,32.69887801,32.69887801,32.69887801,
32.69887801,32.69887801,32.69887801,32.76850286,32.76850286,32.76850286,34.38806886,
35.15059696,35.15059696,35.17191016,35.17191016,35.17191016,35.89604506,37.79523006,
38.91062906,42.01345806,43.07697206,43.24430266,47.23448666,47.64692766,47.64692766,
47.64692766,47.64692766,47.64692766,47.71434354,49.10115554,49.60093624,49.71193614,
49.71193614,49.71193614,49.71193614,49.71193614,49.75737655,50.03237955,50.49420995,
50.543521,53.758917,69.469847,71.634262,80.561103,81.0511546,81.8669166,82.2741689,
84.8077339,92.6058159,94.8547169,95.2502439,95.2502439,95.2502439,96.3743419,
106.7631619,117.8849019,118.9028679,124.7232399,131.9479449,144.0681049,157.2011649,
170.0676949,171.5463129,173.2228369,174.8507509,176.5680759,177.5140754,179.8159774,
180.3869275,180.708029,182.810761,205.045081,208.064288,221.407228,223.440328,
225.378739,227.574139,230.316327,234.359699,239.339686,249.285726,254.530601,258.851446,
259.876842,262.868797,269.3764,279.346905,289.781865,296.474332,316.070712,360.530472,
394.090652,420.136432,427.588307,435.5426,454.47683,475.07557,476.34619,480.382171,
485.839454,487.668204,491.538405,518.020495,551.653865,574.162415,588.321755,607.128845,
619.989315,643.445565,670.522415,687.704505,697.931485,713.849635,726.942465,736.040755,
753.143285,767.589345,780.219885,781.401867,781.6820652,781.6820652,781.9534316,782.0640614,
782.0960854,782.1381057,782.1381057,782.1381057,782.2945485,782.4209258,784.6749738,
789.4316768,804.3474768,819.1349368,834.4669568,836.6907208,854.9105708,858.1095158,
862.6569488,864.7032878,867.1775338,873.0479408,877.8382878,896.0620678,927.5685878,
962.9229278,992.6912478), .Names = c("DAY","CUMSUM"), class = "data.frame",
row.names = c(NA, -200L))

Method of Change Point Detection with least SSR using Two Phase Linear Regression is given in the picture below-

method.png

Expected output should be plotted as given in the figure-

Figure.png

I searched a lot but could not find any package in R.
Can you suggest the method to get the result in R or NCL?
Thanks in advance.

Link to access the research article which followed the same method i.e. Change Point Detection with least SSR using Two Phase Linear Regression
 
Last edited:
#7
Have you looked at the changepoint package?
Hi @Dason , I installed the package 'chngpt' and found that it works only with the some threshold limit input by the user.
But, here I want the "Change Point" using least "Sum of Squared Residuals" using "Two Phase Linear Regression".
Application of such method is given in a nice article accessible from the link-
Link to access the research article which followed the same method i.e. Change Point Detection with least SSR using Two Phase Linear Regression
Crisp and clear concept of the method given in the above mentioned article and the method I am talking about is attached here as the image file.
Please have a look at it.
I hope you got it @Dason.
 

Attachments

Dason

Ambassador to the humans
#8
I just checked out that package... It wasn't the one I mentioned. But it did mention the two phase method you previously mentioned so maybe you should dig into the documentation more. Or use the actual package I mentioned.
 
#9
I just checked out that package... It wasn't the one I mentioned. But it did mention the two phase method you previously mentioned so maybe you should dig into the documentation more. Or use the actual package I mentioned.
Dr. @Dason,
Thank you for the important reply.
I found that the detection of change point in the package "changepoint" is based on change in 'mean', 'variance' and 'meanvar' {[ cpt.mean(), cpt.var() and cpt.meanvar() ].
I discussed about the Change Point detection with least Sum of Squared Residuals using Two-Phase Linear Regression.
I hope it is relevant info.
 

hlsmith

Less is more. Stay pure. Stay poor.
#11
Alright, I am back. Yes, the initial package @Dason referenced did seem related to means and variances - which is interesting. Though, my current dataset is cumulative counts. I am going to turn it into just daily counts and explore the chngpt package, which seems like it works with Poisson, however my counts will be large so it may be approximated by Gaussian.

Of note, my series of counts has exponential growth, though an exogenous boost occurred to dampen it mid series (the change point).

Of note, my brain has never really been able to process using bootstrap and cross-validation with time series. How does the former work since you would be pulling out ordered data, the latter could use a subsequent series, but if applied to current data wouldn't it also be pulling out ordered obs, thus creating holes in the series along with redundancies for some dates?

We will see.
 
Last edited:

hlsmith

Less is more. Stay pure. Stay poor.
#12
OK, I was able to successfully run my code and it gave me the change point on the day I had already visualized it on before. I had content knowledge for my data generating process, though the impact was lagged so there was some uncertainty of when the change occurred. However when I subsequently model this series, I use the selected change point day +1, which looks like it has a better fit and was what I had already selected for the change. The change point +1 was listed in the output as the upper value for the change per the procedure. But regardless I would have selected it due to the visual fit, though it is nice to have analytic support for the selection even if it is post hoc.

I just used the following code:
Code:
#converting the cumulative daily counts to just daily counts, "3" was the count on day 1.
Infected_Counts <- ave(Infected, FUN=function(x) c(0, diff(x)))
Infected_Counts[1] <- 3


install.packages("chngpt")
library(chngpt)

x <- 1:length(Infected_Counts)
print(d.AD <- data.frame(x, Infected_Counts))
fit.4=chngptm(formula.1=Infected_Counts ~ 1, formula.2=~x, data=d.AD, family="poisson",
              type="segmented", var.type="bootstrap", verbose=1, ci.bootstrap.size=1)
summary(fit.4)
plot(fit.4)
I would be happy to discuss the output and actual analytic procedure to ensure I know what is going on myself.
 
Last edited:

hlsmith

Less is more. Stay pure. Stay poor.
#13
Hey, I hope you don't mind I am commandeering this thread while I rationalized things. In the output it provides the following:

chngpts = 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
fastgrid.ok = FALSE ; est.method = grid
logliks = -751.5650216 -747.2395804 -742.6638648 -737.0285905 -730.930023 -725.7322149 -720.0614531 -712.7695921 -707.8429454 -704.3647837 -700.6251235 -696.9187636 -693.1645942 -690.5993699 -691.0295108 -693.99184 -696.2312196 -701.4691934 -707.3874673 -712.1572056 -716.479383 -721.1186161 -725.7595248 -729.9121133 -733.293471 -736.5864436 -739.4186653 -742.379522 -745.030937 -747.2982942 -748.1365563 -748.4082391 -749.2013441 -748.388589 -746.5350909 -744.0474837 -739.9317112 -739.342964 -738.2807113 -743.2383354 -741.7844046 -732.8087177 -727.9708072 -737.1333515

I can see that the respective loglikelihood for chngpt = 20 is the lowest. I may skim the documentation for the package now to understand the selection process or what models/submodel comparison is happening.
 

hlsmith

Less is more. Stay pure. Stay poor.
#14
@Dason - how do I access the bootstrap samples? I wanted to better see what is going on!

Code:
install.packages("chngpt")
library(chngpt)
set.seed(2019)
x <- 1:length(Infected_Counts)
print(d.AD <- data.frame(x, Infected_Counts))
fit.4=chngptm(formula.1=Infected_Counts ~ 1,
              formula.2=~x,
              data=d.AD,
              family="poisson",
              type="segmented",
              var.type="bootstrap",
              verbose=1,
              ci.bootstrap.size = 100, #can change this
              alpha = 0.05,
              save.boot = TRUE)
summary(fit.4)
plot(fit.4)
Side note, here is the paper associated with the package: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1863-x

P.S., The package has the dependency: boot
P.S.S., chngptpm function code: https://rdrr.io/cran/chngpt/src/R/chngptm.R
 
Last edited:

hlsmith

Less is more. Stay pure. Stay poor.
#15
Well as I am patiently waiting for @Dason to enlighten me I ran your data. @Dason see following for a reproducible example.

Code:
rain <- c(0.4975167,0.4975167,0.4975167,0.4975167,0.4975167,
  0.4975167,0.55045359,0.6252087,0.68326339,0.77695034,0.77695034,0.77695034,0.77695034,
  0.77695034,0.77695034,0.77695034,0.77695034,0.77695034,0.77695034,0.77695034,0.77695034,
  0.77695034,0.77695034,0.77695034,0.77695034,0.77695034,0.77695034,0.78782371,0.78782371,
  0.78782371,1.03950021,1.03950021,1.03950021,1.044249829,1.887166329,7.275246329,15.43777033,
  15.86143903,20.02444103,25.91613603,26.27224823,31.62583723,31.62583723,31.62583723,
  31.62583723,32.06164673,32.06164673,32.09113609,32.09113609,32.31486939,32.53086649,
  32.69404529,32.69887801,32.69887801,32.69887801,32.69887801,32.69887801,32.69887801,
  32.69887801,32.69887801,32.69887801,32.76850286,32.76850286,32.76850286,34.38806886,
  35.15059696,35.15059696,35.17191016,35.17191016,35.17191016,35.89604506,37.79523006,
  38.91062906,42.01345806,43.07697206,43.24430266,47.23448666,47.64692766,47.64692766,
  47.64692766,47.64692766,47.64692766,47.71434354,49.10115554,49.60093624,49.71193614,
  49.71193614,49.71193614,49.71193614,49.71193614,49.75737655,50.03237955,50.49420995,
  50.543521,53.758917,69.469847,71.634262,80.561103,81.0511546,81.8669166,82.2741689,
  84.8077339,92.6058159,94.8547169,95.2502439,95.2502439,95.2502439,96.3743419,
  106.7631619,117.8849019,118.9028679,124.7232399,131.9479449,144.0681049,157.2011649,
  170.0676949,171.5463129,173.2228369,174.8507509,176.5680759,177.5140754,179.8159774,
  180.3869275,180.708029,182.810761,205.045081,208.064288,221.407228,223.440328,
  225.378739,227.574139,230.316327,234.359699,239.339686,249.285726,254.530601,258.851446,
  259.876842,262.868797,269.3764,279.346905,289.781865,296.474332,316.070712,360.530472,
  394.090652,420.136432,427.588307,435.5426,454.47683,475.07557,476.34619,480.382171,
  485.839454,487.668204,491.538405,518.020495,551.653865,574.162415,588.321755,607.128845,
  619.989315,643.445565,670.522415,687.704505,697.931485,713.849635,726.942465,736.040755,
  753.143285,767.589345,780.219885,781.401867,781.6820652,781.6820652,781.9534316,782.0640614,
  782.0960854,782.1381057,782.1381057,782.1381057,782.2945485,782.4209258,784.6749738,
  789.4316768,804.3474768,819.1349368,834.4669568,836.6907208,854.9105708,858.1095158,
  862.6569488,864.7032878,867.1775338,873.0479408,877.8382878,896.0620678,927.5685878,
  962.9229278,992.6912478)
rain
dailyrain <- ave(rain, FUN=function(x) c(0, diff(x)))
dailyrain[1] <- 0.4975167
dailyrain
set.seed(2019)
x <- 1:length(dailyrain)
print(d.AD <- data.frame(x, dailyrain))
fit.rain=chngptm(formula.1=dailyrain ~ 1,
              formula.2=~x,
              data=d.AD,
              family="gaussian",
              type="hinge",
              var.type="bootstrap",
              verbose=1,
              ci.bootstrap.size = 10, #can change this
              alpha = 0.05,
              save.boot = TRUE)
summary(fit.rain)
#Assuming the loglikelihood values from fitting that model w/ chngpt
plot(fit.rain)
Let me know if you figure out where it saves the bootstrap samples. It says your change point is 67(CI: 44, 90)
1588440808863.png

P.S., The residuals seem heterogeneous - not sure what that may mean for your intentions.
 
#16
Hello @hlsmith , I am here. Sir, the explanation and change point detection at .95 Sig. level is really wonderful.
But, I was curious to know about the Change Point detection with minimum Sum of Squared Residuals value using Two Phase Linear Regression.
Here we have the method on what I was exactly talking about.
I request you to have a look at the image below.
With the application of this method change point should lie between 121 or 122.
Dear Sir, I hope you don't mind it and this picture makes something clear.
Your input on this is highly important. Waiting for your reply.

Link to access the research article which followed the same method i.e. Change Point Detection with least SSR using Two Phase Linear Regression

method.png
 
Last edited:

hlsmith

Less is more. Stay pure. Stay poor.
#17
Well not my skill set but you could fed the number of obs into an function to create cutpoints then run them in lm function to get the values or just run every combo yourself :)