Spatial Analysis III
Exploratory Spatial Data Analysis
The first law of geography states “where distance itself hinders interaction between places. The farther two places are apart, the greater the hindrance” (Tobler, 1970).
On exploring spatial data we want to identify and visualize spatial patterns of certain events.
In the image above, there is a comparative map of the gun violence rate in Brazil in 2009 and 2019. At first sight, you can notice the colours from blue (low-low) to red (high-high) violence rate in which the 25 states are clustered in specific ranges.
It is possible to see that in the central and southern areas of the map, there was a decrease in the violence rate, contrasting with the increase in the north and northeast areas. To systematically do such analysis, you can follow these steps:
- Choose a criterion of a neighbourhood;
- Build a matrix of spatial lag (𝐖);
- Calculate global and local autocorrelation;
- Estimate models.
1. Neighbourhood Matrices
The first step to perform an Exploratory Spatial Data Analysis (ESDA) is to establish neighbourhoods between the studied localities so we can verify their autocorrelations, verify some heterogeneities, and even detect outliers.
According to Anselin and Rey (2014), the establishment of neighbourhoods is performed by a spatial weighting matrix 𝐖 that can assume several types, being the most common:
- the Adjacency matrix;
- the geographical proximity;
- and the socio-economic proximity.
2. Matrix 𝐖
2.1. Contiguity Matrix
The idea of Contiguity is from the assumption of the existence of a common physical border between spatial units. In this line of reasoning, it can be stated that Brazil is contiguous to Argentina, but it’s not to China, for example.
Based on the previous assumption, a Contiguity Matrix is a matrix 𝐖with binary values in which the value 1 is determined in the presence of a common physical border, and 0 for absence.
We can describe your terms mathematically wᵢⱼ as:
- wᵢⱼ = 1, 𝑖𝑓 𝑡ℎ𝑒𝑟𝑒 𝑖𝑠 𝑐𝑜𝑛𝑡𝑖𝑔𝑢𝑖𝑡𝑦 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑖 𝑒 𝑗;
- wᵢⱼ = 0, 𝑖𝑓 𝑡ℎ𝑒𝑟𝑒 𝑖𝑠 𝑛𝑜𝑡 𝑐𝑜𝑛𝑡𝑖𝑔𝑢𝑖𝑡𝑦 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑖 𝑒 𝑗.
- wᵢᵢ = 0, by convention.
Furthermore, you can choose the criteria of continuity like in a Chess game: Queen Criteria if there are neighbours at any direction; Rook Criteria if there are neighbours in the vertical or horizontal direction; or Bishop Criteria if there are neighbours in the diagonal direction.
Also, you can decide how many levels of contiguity will be considered, i.e. the neighbour of your neighbour is a level 2 contiguity.
Hands-on
In R, you can establish the neighbourhood criteria by using the poly2nb function (polygons to neighbours). In this example, we are connecting the cities of São Paulo Estate (in the shapefile shp_sp) that are connected by the Queen Criteria.
When the summary function is applied to the neighbour_queen object you can see in detail the following info, including the linked number distribution (that can be read like, there are 143 regions with 5 links):
To create the W matrix, you can use the nb2mat function:
To trace a higher level of contiguity, the nblag function (neighbours lag) can be used:
The result is plotted below, where you can see more connections being formed between cities.
2.2. Geographical Proximity
These criteria establish a distance dᵢ(k) to determine if the regions i and j are neighbours. This criterion is used when spatial proximity could indicate greater interaction.
- wᵢⱼ = 1, 𝑖𝑓 dᵢⱼ ≤ dᵢ(k);
- wᵢⱼ = 0,𝑖𝑓 dᵢⱼ > dᵢ(k).
- wᵢᵢ = 0, by convention.
Hands-On
This time, we will use the proximity criteria to define Bahia State cities that are neighbours within a distance of 90km.
From the plot, we can see that the cities located in the north and west of the state have fewer neighbours due to their areas being bigger and thus, their centroids being further from each other.
2.3. K Nearest neighbours
This criterion establishes a number of k neighbours to each region. Its main advantage is to not create isolated regions, which can happen in Contiguity Criteria if there are islands on the map, and also in the geographic Distance if the region is located too far from others.
- wᵢᵢ = 0, by convention.
Hands-On
Now we will use the k-nearest criteria to define Santa Catarina State cities 3 nearest neighbours.
2.4. Social, Economic Distance
We can also establish a W matrix using non-geographic criteria to calculate the “distance” between regions. For example, you can approximate regions due to their HDI, crime rate, health-related statistics, etc.
Hands-On
The following algorithm has a very similar approach to the geographical distance method. We are using the shapefile from São Paulo State and join some economic data, including the PIB per capita to each city. We are setting the distance to 0.01 pattern deviation to link the neighbours and to create the W Matrix.
Note: the social_distance function can be found in my repository.
3. W Matrix Standardization
So far, we have been getting the matrix 𝐖 in its binary form. However, this way we assume all neighbours have the same weight/impact on each region. In most cases, it is suggested to adopt some process of standardization, which the most common are 𝐖 row-standardization, double standardization and variance stabilizing.
3.1. Row-Standardization
In this method, we take the individual weight and divide it by the total weight of its relative row.
Wᵢⱼ = wᵢⱼ / ∑ⱼ wᵢⱼ
In which, the sum of the total weight in each row has to be 1, while the sum of all elements of the matrix will be equal to the number of observations n:
S₀ = ∑ᵢ ∑ⱼ Wᵢⱼ = n
In case there are q observations without neighbours, then S₀ = n — q.
To use this standardization, simply change the parameter style in the nb2mat function to “W”, instead of “B” for Binary.
3.2. Double-Standardization
In this method, we want to create weight the matrix by both rows and cols. The result is a stochastic matrix, where the S₀ = 1.
Wᵢⱼ = wᵢⱼ / ∑ᵢ ∑ⱼ wᵢⱼ
To use this standardization, simply change the parameter style in the nb2mat function to “U”.
3.3. Standardization by Stabilizing the Variance
This method was proposed by Tiefelsdorf, Griffith and Boots (1999) and is useful when predicting models and the data need to present homoscedasticity, which means the homogeneity of variances.
The standardized value of spatial weight is obtained in two steps:
- First, each original weight of the row 𝑖must be divided by the square root of the sum of the weights squared of their respective 𝑖 row, originating a new weight called
w*ᵢⱼ = wᵢⱼ / sqrt ( ∑ⱼ w²ᵢⱼ )
- Hereupon, each weight w*ᵢⱼ must be multiplied by the factor presents in:
𝑛−𝑞 / 𝑄
in which 𝑛 it is the total of observations; 𝑞 is the total of observations without neighbours; 𝑄 is given by:
𝑄= ∑ᵢ ∑ⱼ w*ᵢⱼ
To use this standardization in R, simply change the parameter style in the nb2mat function to “S”.
4. Spatial Autocorrelation
After establishing the neighbourhoods, and their respective W matrix of spatial lag, we can verify if the observed data are distributed in a random way or it is a spatial pattern, that is, if there is the spatial autocorrelation involving the studied phenomenon.
- The global autocorrelation metrics are directed to measure the degree of the spatial relation of a phenomenon regarding all values observed in the database.
- In opposition, the local autocorrelation metrics measure the autocorrelations of observations, one by one, regarding their neighbourhood established.
4.1. Global Autocorrelation
The statistics 𝐼 was first proposed by Moran (1948) and, years later, Cliff and Ord (1973, 1981) presented a more developed work on the original ideas of Moran, determining the following formula:
In which 𝑛 represents the number of observations; 𝑧 points out the standardized values of the dependent variable 𝑌 by the procedure 𝑧𝑠𝑐𝑜𝑟𝑒𝑠.
To validate this statistic, we test the null hypothesis which indicates that the values 𝑌ᵢ are independent of the values of neighbouring observations, that is, that there is no spatial autocorrelation for a given significance level (random spatial data).
𝐻₀: I = − (1/ 𝑛−1)
Pearson Correlation Parallel
How does the I statistics from Moran Global autocorrelation compares to the Pearson Correlation p statistic?
The Pearson correlation ρ is calculated through the covariance of X and Y variables. The range goes from -1 to 1, the extremes would represent a perfect correlation, while zero is a non-correlation between variables.
In the Moran autocorrelation, we are analysing how the variable behaves in itself by a given W neighbourhood matrix. The value can also vary from -1 to 1 (in a few cases, it can break that rule), but the null value will be the -1/(n-1) in which there is no autocorrelation.
A positive I value means similarity in behaviour between regions. That means, when there is a high Y value in a region, the neighbouring regions will also be high (high-high) and when there is a low value, the surroundings will be low as well (low-low). Also, in a positive I value, the spillover effect can occur, when one region can affect another.
A negative I value means non-similarity between regions. It means that high Y-value regions will probably have low value (high-low) and the opposite also occurs (low-high).
Hands-On
Let's check the spatial auto-correlation between cities in São Paulo by their Human Development Index (HDI). First, let’s plot the data:
Then, to calculate the autocorrelation, we need to get the W matrix containing the neighbourhood set, in this case, it is the queen contiguity order 1. To use the matrix object, we need to convert it to a list and then use the function Moran test.
From the output, we notice the I = 0.2328, which is a positive autocorrelation. The null hypothesis is rejected by the low p-value.
H₀: I = - 1 / (645 -2) = -0.00155521
To plot the Moran Autocorrelation Diagram, we use:
Some information, from the diagram plot: the x-axis is the value from the Z-scores for the interested Y variable (HDI, in this case), the y-axis is the Z-score value by the W matrix, the vertical and horizontal lines split the plot by the mean values in each direction.
Extra: How to manually create and calculate the Weighted Values and the same plot:
4.2. Local Autocorrelation
Anselin (1995) proposed a way to measure the local autocorrelations called Local Indicators of Spatial Association (LISA). The LISA technique aims to identify local patterns of spatial association (Anselin, 1995, p. 93).
According to Lansley and Cheshire (2016), the LISA technique investigates the spatial relations between the data, considering the established neighbourhoods.
Among the types of LISA proposed by Anselin (1995) — e.g. Gamma Local, Geary Local, etc. — the most commonly referenced type will be presented: the Local Moran.
4.2.1. Local Moran Auto-correlation
Iᵢ = 𝑧ᵢ ∑ ⱼ 𝑤ⱼ zⱼ
in which, in a similar way to Moran’𝐼s Statistics, 𝑧𝑖and zⱼ represent the standardized values of the dependent variable; the considered sum includes only each neighbour that 𝑗belongs to the neighbourhood 𝐽𝑖established by the spatial weighing 𝑊matrix. and the spatial weight 𝑤ᵢⱼ are, preferentially, row standardized to facilitate the interpretation, without omitting that, by convention, 𝑤𝑖𝑖 = 0.
Hands-On
In this example, we are going to continue examining the São Paulo State HDI by city, but now zooming into each local autocorrelation.
To calculate in R, the Moran local autocorrelation:
The result is a table with each city Ii statistic. You can evaluate the significance level by looking at the Pr(z).
Extra: To manually calculate the Moran Local Autocorrelation Iᵢ:
Then, let’s give a first look at the calculated statistic by plotting them back into the map:
At last, we will categorize the data into the HH, LL, LH and HL quadrants and exclude the non-significant observations.
Result of plotting the Moran Autocorrelation Significant clusters:
4.2.1. Getis-Ord General G Local Autocorrelation
Getis and Ord (1992) proposed another way to study the spatial association of observations of a given database, based on the spatial concentration.
The biggest difference between Moran and Getis-Ord statistics is that while the first classify clusters, the other will set zones with high or low values. Also, to use the least, the neighbourhood criteria have to be based on geographic distances.
Gᵢ (d) = ∑ⁿj-₁ wᵢⱼ (d) Yⱼ / ∑ⁿj-₁ Yⱼ
In which 𝑤ᵢⱼ represents the binary weight of a spatial weighting matrix by distances and, by convention, 𝑤ᵢᵢ= 0; the numerator represents the sum of all neighbouring values 𝑌ⱼ within the neighbourhood established by the distance 𝑑 of 𝑖, without 𝑌ᵢ; and the denominator represents the sum of all neighbouring values 𝑌ⱼ without 𝑌ⱼ.
Hands-On
In this example, we will also analyse the HDI index local autocorrelation, but in Bahia State and using the Getis-ord g statistic:
The final result, displaying 8 quantiles and showing where the hot and cold spots on the map:
Credits:
This content is based on my class notes from USP ESALQ MBA Data Science & Analytics class.
You can check my repository here: https://github.com/marisakamoto/DataScienceMBA/tree/main/03_SpatialAnalysis