Correlation to determine the best reference series for homogenization

Before asking this, I read similar questions, but none of them lead to satisfying answer for my specific interest.

I want to homogenize a 64 years (1940-2003) climate time series of precipitation of Dominican Republic. For that, it is really important to select a reference series among a group of candidates.

Let's say "sjo" is the base series, for which I want to find a good reference series; "bani", "plc" and "ra" are reference candidates, because they are close to "sjo". In the jpg attached map, the red point is the base station, and the green ones are the reference candidates:

I made three correlation analysis (done in R, function cor()), considering this monthly variables: raw precipitation value, normalized difference, and transformed values with Box-Cox. Those variables correspond, respectively, to fields that begin with "p", "dian" and "pnorm".

Normalized difference comes from the first difference series method (FDM), which was proposed by Peterson, consisting of:
[Pm(t) - Pm(t-1)] / [Pm(t) + Pm(t-1)],
where Pm(t) is the precipitation value for the month m, and Pm(t-1) is the precipitation for the same month 1 year before. I followed Peterson et al. (1998) remark, which says that FDM applied to precipitation might work better using normalized difference.

As can be seen in page 1 the attached PDF, correlation was calculated for the whole time series (1940-2003). For raw precipitation and Box-Cox transformed values, "bani" is the best correlated with "sjo" (yellow background cells shows the maximum correlation index). Notice that for raw precipitation, "bani" is significantly more correlated than others. For normalized difference, "ra" is only a bit more correlated than the rest. However, each candidate station has statistically significant correlation index with "sjo" at a 0.05 significance level, suggesting ANY of them could be used as a reference series.

This is a bit confusing, so, I was unsatisfied and decided to make a more detailed analysis, spliting the series in 5 years periods intervals, and evaluating correlation for between series for the same 3 variables: raw precipitation, normalized difference and Box-Cox transformed.

Tables from page 2 to 8 in the attached PDF, show the results of these partial correlations; the last page summarizes the times each station has had the maximum correlation value for each variable. As can be seen, "bani" is the most frequently correlated value for the 3 variables analyzed (in all cases, more than 7 times of the twelve 5-years periods analyzed).

With these results, I think that "bani" is the best candidate as a reference series of "sjo", but I'm not sure about it. Is the five-years period analysis OK? Should I accomplish some other analysis?


Last edited:
Hi nonviolencia,

personally i think your point of view with working with correlations is one of the right ways to find a reference for jose de ocoa. Yet you seem to struggle with finding an argumentation for your decision to make "bani" your reference point.

Let me try to give another possible way of basing your decision on facts.

I would suggest doing a linear regression (if your data is normaly distributed or "nearly is"). Let the raindata for jose be the dependant variable. The independent therefore are the values of the other three stations. Check the p-values for each station and also look at the standardized regression coefficients . I think for bani the standardized regression coefficient should be the highest - leading you to the conclusion, that it might be the best reference. As you are using R i will post an example with random values.

#Generation of totally randomized Rain-values. Lets pretend it liters per year
year<-seq(1940,2003,1)# years->use your own data
sjo.rain<-arima.sim(n=64, list(ar=c(0.999999)))# randomized->use your own data
pbani.rain<-arima.sim(n=64, list(ar=c(0.999997)))# randomized->use your own data
plc.rain<-arima.sim(n=64, list(ar=c(0.999991)))# randomized->use your own data
pra.rain<-arima.sim(n=64, list(ar=c(0.999992)))# randomized->use your own data
#Build data.frame
# Correlation matrix
par(mfrow=c(2,2))#visualize the correlations
# Linear Regression per "Reference point" - normal distribution is not always given but this is just for showcasing!
#Calculation of the standarized regression coefficient 
library(QuantPsyc)#install package
lm.beta(fit)#get standardized regression coef.
#in this example with randomized values pbani.rain has the highest "influence" on the "depending" variable sjo.rain.Therefore it should be used as a reference
You may of course also use simple t-test for finding significant differences in the distribution of mean of all variables. Here i think you should look for the lowest t-value (hence highest p-value) to look for non-significant differences between the stations as they (low t, high p) may be an indicator for similarity of mean.

With best regards

Thanks Sebastian for this detailed answer. Sure I'll try these steps, and will let you know.

Also, I invite you to see a discussion on this same topic in another forum...

...where you'll see that I applied t-test to asses the differences between correlation values of pairs of variables. But your suggestion, of doing a t-test between the same variable of two stations, is a good idea and will try it.