Which test should I use?

#1
Hello all,

I am a novice in statistics so I hope I explain my question so that you understand, here goes.

I have a dataset of archaeological sites that are distributed over a large area. I divided this area into regions according to vegetation characteristics. I've summarized the sites' distributions in the different regions and now I want to compare those results with the natural distribution. Put differently, I want to know whether sites are distributed randomly or according to environmental preferences (ex. in green areas rather than deserts)?

Hope everything is clear
Thanks for your help
 

Karabiner

TS Contributor
#2
How many sites are there, and how many areas? Home many different types of areas are there? Home many criteria did you use to identify the type of area, and were the criteria quantitative or categorical? Thor areas are of different sitze, I suppose?

With kind regards

Karabiner
 

gianmarco

TS Contributor
#3
Hello,
I do not know what type of data you have, are you working with some GIS software? Do you have GIS data? For instance, a geometry representing your areas, and a geometry representing your sites?

Assuming that you have the above type of data, and assuming that your regions completely cover your study area (with no gap in-between), you may want to test if the distribution of points within your set of polygons (totally covering the study area) can be considered random, or if the observed points count in each polygon is larger or smaller than expected.

I have built a function in my R package 'GmAMisc'. The function is called 'pointsInPolygons()'. The calculations relative to the above scenario are based on the binomial distribution: the probability of the observed counts is dbinom(x, size=n.of.points, prob=p), where 'x' is the observed number of points within a given polygon, 'n.of.points' is the total number of points, and 'p' is equal to the size of each polygon relative to sum of the polygons' area. The probability that x or fewer points will be found within a given polygon is pbinom(x, size=n.of.points, prob=p).

If you have GIS data, you can feed them into R and use the function.

I attach an example of the output (observed vs. expected counts of points withing polygons, and p values). Screenshot 2020-10-11 at 19.02.20.png Screenshot 2020-10-11 at 19.02.20.png

Best
 
#4
How many sites are there, and how many areas? Home many different types of areas are there? Home many criteria did you use to identify the type of area, and were the criteria quantitative or categorical? Thor areas are of different sitze, I suppose?

With kind regards

Karabiner
Hi,
Thanks for your reply,
I have 437 sites. For now, I want to look at them together but later I might also decide to subdivide them into smaller groups (three, or seven). This is based on archaeological data such as chronology and typology (tool types).
There are five area types, which were defined by ecologists as environmentally distinct (arid, semi-arid, dunes, human-influence and other).
Data are categorical. The areas are not similar in size.

I'm posting below actual numbers in the hope it might help.

area name area %
Saharo-Arabian 43984.6 59
Sand 13977.6 18.7
Irano-Turanian 6405.9 8.6
synanthropic 2710.8 3.6
other 7491.1 10
total area 74570.1 100

number of sites 437

thanks again
 
#5
Hello,
I do not know what type of data you have, are you working with some GIS software? Do you have GIS data? For instance, a geometry representing your areas, and a geometry representing your sites?

Assuming that you have the above type of data, and assuming that your regions completely cover your study area (with no gap in-between), you may want to test if the distribution of points within your set of polygons (totally covering the study area) can be considered random, or if the observed points count in each polygon is larger or smaller than expected.

I have built a function in my R package 'GmAMisc'. The function is called 'pointsInPolygons()'. The calculations relative to the above scenario are based on the binomial distribution: the probability of the observed counts is dbinom(x, size=n.of.points, prob=p), where 'x' is the observed number of points within a given polygon, 'n.of.points' is the total number of points, and 'p' is equal to the size of each polygon relative to sum of the polygons' area. The probability that x or fewer points will be found within a given polygon is pbinom(x, size=n.of.points, prob=p).

If you have GIS data, you can feed them into R and use the function.

I attach an example of the output (observed vs. expected counts of points withing polygons, and p values). View attachment 2650 View attachment 2650

Best
Hello,
I do not know what type of data you have, are you working with some GIS software? Do you have GIS data? For instance, a geometry representing your areas, and a geometry representing your sites?

Assuming that you have the above type of data, and assuming that your regions completely cover your study area (with no gap in-between), you may want to test if the distribution of points within your set of polygons (totally covering the study area) can be considered random, or if the observed points count in each polygon is larger or smaller than expected.

I have built a function in my R package 'GmAMisc'. The function is called 'pointsInPolygons()'. The calculations relative to the above scenario are based on the binomial distribution: the probability of the observed counts is dbinom(x, size=n.of.points, prob=p), where 'x' is the observed number of points within a given polygon, 'n.of.points' is the total number of points, and 'p' is equal to the size of each polygon relative to sum of the polygons' area. The probability that x or fewer points will be found within a given polygon is pbinom(x, size=n.of.points, prob=p).

If you have GIS data, you can feed them into R and use the function.

I attach an example of the output (observed vs. expected counts of points withing polygons, and p values). View attachment 2650 View attachment 2650

Best
Hi, Gianmarco
I am working with GIS, and do have the geometry. your suggestion sounds like exactly what I need. the polygons do cover the entire study area, but the environmental zones I'm using (the areas) aren't continuous but patchy (i.e. there are two or more non-neighbouring areas that are defined the same).
In a later test, I would like to perform a similar test with much smaller areas (thoroughly surveyed areas) that do not cover the entire area.
In any case, I'm ashamed to say I don't know how to work with R.

Thanks for your help
 

Karabiner

TS Contributor
#6
If area type and site density are independent, then the
437 Sites would be distributed across types according
to type size.

E.g. 59% of 437 = 258 sites would be expected
in type 1 areas.

You can compare the sample distribution with the expected
distribution across the 5 types using a Chi Square test.

With kind regards

Karabiner
 

gianmarco

TS Contributor
#7
If area type and site density are independent, then the
437 Sites would be distributed across types according
to type size
.

E.g. 59% of 437 = 258 sites would be expected
in type 1 areas.

You can compare the sample distribution with the expected
distribution across the 5 types using a Chi Square test.

With kind regards

Karabiner
I would agree with that, provided that the areas completely cover the study area, with no gap in-between.
 
Last edited:

gianmarco

TS Contributor
#8
Hi, Gianmarco
I am working with GIS, and do have the geometry. your suggestion sounds like exactly what I need. the polygons do cover the entire study area, but the environmental zones I'm using (the areas) aren't continuous but patchy (i.e. there are two or more non-neighbouring areas that are defined the same).
In a later test, I would like to perform a similar test with much smaller areas (thoroughly surveyed areas) that do not cover the entire area.
In any case, I'm ashamed to say I don't know how to work with R.

Thanks for your help
Sorry for not coming back to you, I start teaching and my spare time is now very limited.
Also, I am scratching my head about how to approach the issue, since your areas are discrete and (at the best of my understanding) there are in-between zones. I was wondering, are there any sites in thos in-between zones?