Clustering and Regionalization — Geographic Data Science with Python (2024)

The world’s hardest questions are complex and multi-faceted.Effective methods to learn from data recognize this. Many questionsand challenges are inherently multidimensional; they are affected, shaped, anddefined by many different components all acting simultaneously. In statisticalterms, these processes are called multivariate processes, as opposed tounivariate processes, where only a single variable acts at once.Clustering is a fundamental method of geographical analysis that draws insightsfrom large, complex multivariate processes. It works by finding similarities among the many dimensions in a multivariate process, condensing them down into a simpler representation.Thus, through clustering, a complex and difficult to understand process is recast into a simpler one that even non-technical audiences can use.

Introduction#

Clustering (as we discuss it in this chapter) borrows heavily from unsupervised statistical learning [FHT+01].Often, clustering involves sorting observations into groups without any prior idea about what the groups are (or, in machine learning jargon, without any labels, hence the unsupervised name).These groups are delineated so that members of a group should be moresimilar to one another than they are to members of a different group.Each group is referred to as a cluster while the process of assigningobjects to groups is known as clustering. If done well, these clusters can becharacterized by their profile, a simple summary of what members of a group are like in terms of the original multivariate phenomenon.

Since a good cluster is moresimilar internally than it is to any other cluster, these cluster-level profilesprovide a convenient shorthand to describe the original complex multivariate phenomenonwe are interested in.Observations in one group may have consistently highscores on some traits but low scores on others.The analyst only needs to look at the profile of a cluster in order to get agood sense of what all the observations in that cluster are like, instead ofhaving to consider all of the complexities of the original multivariate process at once.Throughout data science, and particularly in geographic data science,clustering is widely used to provide insights on the(geographic) structure of complex multivariate (spatial) data.

In the context of explicitly spatial questions, a related concept, the region,is also instrumental. A region is similar to a cluster, in the sense thatall members of a region have been grouped together, and the region should providea shorthand for the original data within the region.For a region to be analytically useful, its members also shoulddisplay stronger similarity to each other than they do to the members of other regions.However, regions are more complex than clusters because they combine thissimilarity in profile with additional information about the location of their members: they should also describe a clear geographic area.In short, regions are like clusters (since they have a consistent profile) where all their membersare geographically consistent.

The process of creating regions is called regionalization [DRSurinach07].A regionalization is a special kind of clustering where the objective isto group observations which are similar in their statistical attributes,but also in their spatial location. In this sense, regionalization embeds the samelogic as standard clustering techniques, but also it applies a series of geographical constraints. Often, theseconstraints relate to connectivity: two candidates can only be grouped together in thesame region if there exists a path from one member to another memberthat never leaves the region. These paths often model the spatial relationshipsin the data, such as contiguity or proximity. However, connectivity does notalways need to hold for all regions, and in certain contexts it makessense to relax connectivity or to impose different types of geographic constraints.

In this chapter we consider clustering techniques and regionalization methods. In the process, we will explore the socioeconomiccharacteristics of neighborhoods in San Diego. We will extract common patterns from thecloud of multi-dimensional data that the Census Bureau produces about small areasthrough the American Community Survey. We begin with an exploration of themultivariate nature of our dataset by suggesting some ways to examine thestatistical and spatial distribution before carrying out anyclustering. Focusing on the individual variables, as well as their pairwiseassociations, can help guide the subsequent application of clusterings or regionalizations. We then consider geodemographic approaches to clustering—the applicationof multivariate clustering to spatially referenced demographic data.Two popular clustering algorithms are employed: k-means and Ward’s hierarchical method.As we will see, mapping the spatial distribution of the resulting clustersreveals interesting insights on the socioeconomic structure of the San Diegometropolitan area. We also see that in many cases, clusters are spatiallyfragmented. That is, a cluster may actually consist of different areas that are notspatially connected. Indeed, some clusters will have their members strewn all over the map.This will illustrate why connectivity might be important when building insightabout spatial data, since these clusters will not at all provide intelligible regions.With this insight in mind, we will move on to regionalization, exploring different approaches thatincorporate geographical constraints into the exploration of the social structure of San Diego.Applying a regionalization approach is not always required, but it can provideadditional insights into the spatial structure of the multivariate statistical relationshipsthat traditional clustering is unable to articulate.

from esda.moran import Moranfrom libpysal.weights import Queen, KNNimport seabornimport pandasimport geopandasimport numpyimport matplotlib.pyplot as plt

Data#

We return to the San Diego tracts dataset we have used earlier in the book. In this case, we will not only rely on its polygon geometries, but also on its attribute information. The data comes from the American Community Survey(ACS) from 2017. Let us begin by reading in the data.

# Read filedb = geopandas.read_file("../data/sandiego/sandiego_tracts.gpkg")

ERROR 1: PROJ: proj_create_from_database: Open of /opt/conda/share/proj failed

To make things easier later on, let us collect the variables we will use tocharacterize census tracts. These variables capture different aspects of thesocioeconomic reality of each area and, taken together, provide a comprehensivecharacterization of San Diego as a whole. We thus create a list with the names of the columns we will use later on:

cluster_variables = [ "median_house_value", # Median house value "pct_white", # % tract population that is white "pct_rented", # % households that are rented "pct_hh_female", # % female-led households "pct_bachelor", # % tract population with a Bachelors degree "median_no_rooms", # Median n. of rooms in the tract's households "income_gini", # Gini index measuring tract wealth inequality "median_age", # Median age of tract population "tt_work", # Travel time to work]

Let’s start building up our understanding of thisdataset through both visual and statistical summaries.The first stop is considering the spatial distribution of each variable alone.This will help us draw a picture of the multi-faceted view of the tracts wewant to capture with our clustering. Let’s use (quantile) choropleth maps foreach attribute and compare them side-by-side (Fig. XXX1XXX):

f, axs = plt.subplots(nrows=3, ncols=3, figsize=(12, 12))# Make the axes accessible with single indexingaxs = axs.flatten()# Start a loop over all the variables of interestfor i, col in enumerate(cluster_variables): # select the axis where the map will go ax = axs[i] # Plot the map db.plot( column=col, ax=ax, scheme="Quantiles", linewidth=0, cmap="RdPu", ) # Remove axis clutter ax.set_axis_off() # Set the axis title to the name of variable being plotted ax.set_title(col)# Display the figureplt.show()

Several visual patterns jump out from the maps, revealing both commonalities aswell as differences across the spatial distributions of the individual variables.Several variables tend to increase in value from the east to the west(pct_rented, median_house_value, median_no_rooms, and tt_work), while othershave a spatial trend in the opposite direction (pct_white, pct_hh_female,pct_bachelor, median_age). This will help show the strengths of clustering;when variables havedifferent spatial distributions, each variable contributes distinctinformation to the profiles of each cluster. However, if all variables display very similarspatial patterns, the amount of useful information across the maps isactually smaller than it appears, so cluster profiles may be much less useful as well.It is also important to consider whether the variables display anyspatial autocorrelation, as this will affect the spatial structure of theresulting clusters.

Recall from Chapter 6 that Moran’s I is a commonly usedmeasure for global spatial autocorrelation. We can use it to formalize some of theintuitions built from the maps. Recall from earlier in the book that we will needto represent the spatial configuration of the data points through a spatial weightsmatrix. We will start with queen contiguity:

w = Queen.from_dataframe(db)

Now let’s calculate Moran’s I for the variables being used. This will measurethe extent to which each variable contains spatial structure:

# Set seed for reproducibilitynumpy.random.seed(123456)# Calculate Moran's I for each variablemi_results = [ Moran(db[variable], w) for variable in cluster_variables]# Structure results as a list of tuplesmi_results = [ (variable, res.I, res.p_sim) for variable, res in zip(cluster_variables, mi_results)]# Display on tabletable = pandas.DataFrame( mi_results, columns=["Variable", "Moran's I", "P-value"]).set_index("Variable")table

	Moran's I	P-value
Variable
median_house_value	0.646618	0.001
pct_white	0.602079	0.001
pct_rented	0.451372	0.001
pct_hh_female	0.282239	0.001
pct_bachelor	0.433082	0.001
median_no_rooms	0.538996	0.001
income_gini	0.295064	0.001
median_age	0.381440	0.001
tt_work	0.102748	0.001

Each of the variables displays significant positive spatial autocorrelation,suggesting clear spatial structure in the socioeconomic geography of SanDiego. This means it is likely the clusters we find will havea non-random spatial distribution.

Spatial autocorrelation only describes relationships between observations for asingle attribute at a time.So, the fact that all of the clustering variables are positively autocorrelated does notsay much about how attributes co-vary over space. To explore cross-attribute relationships,we need to consider the spatial correlation between variables. We will take our first dipin this direction exploring the bivariate correlation in the maps of covariates themselves.This would mean that we would be comparing each pair of choropleths to look for associationsand differences. Given there are nine attributes, there are 36 pairs of maps that must becompared.

This would be too many maps to process visually. Instead, we focus directlyon the bivariate relationships between each pair of attributes, devoid for now of geography, and use a scatterplot matrix (Fig. XXX2XXX).

_ = seaborn.pairplot( db[cluster_variables], kind="reg", diag_kind="kde")

Two different types of plots are contained in the scatterplot matrix. On thediagonal are the density functions for the nine attributes. These allow for aninspection of the univariate distribution of the values for each attribute.Examining these we see that our selection of variables includes some that arenegatively skewed (pct_white and pct_hh_female) as well as positively skewed(median_house_value, pct_bachelor, and tt_work).

The second type of visualization lies in the off-diagonal cells of the matrix;these are bivariate scatterplots. Each cell shows the association between onepair of variables. Several of these cells indicate positive linearassociations (median_age vs. median_house_value, median_house_value vs. median_no_rooms)while other cells display negative correlation (median_house_value vs. pct_rented,median_no_rooms vs. pct_rented, and median_age vs. pct_rented). The one variablethat tends to have consistently weak association with the other variables istt_work, and in part this appears to reflect its rather concentrateddistribution as seen on the lower right diagonal corner cell.

Indeed, this kind of concentration in values is something you need to be very aware of in clustering contexts. Distances between datapoints are of paramount importance in clustering applications. In fact, (dis)similarity between observations is calculated as the statistical distance between themselves. Because distances are sensitive to the units of measurement, cluster solutions can change when you re-scale your data.

For example, say we locate an observation based on only two variables: house price and Gini coefficient. In this case:

db[["income_gini", "median_house_value"]].head()

	income_gini	median_house_value
0	0.5355	732900.000000
1	0.4265	473800.000000
2	0.4985	930600.000000
3	0.4003	478500.000000
4	0.3196	515570.896382

The distance between observations in terms of these variates can be computed easily using scikit-learn:

from sklearn import metrics

metrics.pairwise_distances( db[["income_gini", "median_house_value"]].head()).round(4)

array([[ 0. , 259100. , 197700. , 254400. , 217329.1036], [259100. , 0. , 456800. , 4700. , 41770.8964], [197700. , 456800. , 0. , 452100. , 415029.1036], [254400. , 4700. , 452100. , 0. , 37070.8964], [217329.1036, 41770.8964, 415029.1036, 37070.8964, 0. ]])

In this case, we know that the housing values are in the hundreds of thousands, but the Gini coefficient (which we discussed in the previous chapter) is constrained to fall between zero and one. So, for example, the distance between the first two observations is nearly totally driven by the difference in median house value (which is 259100 dollars) and ignores the difference in the Gini coefficient (which is about .11). Indeed, a change of a single dollar in median house value will correspond to the maximum possible difference in Gini coefficients. So, a clustering algorithm that uses this distance to determine classifications will pay a lot of attention to median house value, but very little to the Gini coefficient!

Therefore, as a rule, we standardize our data when clustering. There are many different methods of standardization offered in the sklearn.preprocessing module, and these map onto the main methods common in applied work. We review a small subset of them here. The scale() method subtracts the mean and divides by the standard deviation:

\[ z = \frac{x_i - \bar{x}}{\sigma_x}\]

This “normalizes” the variate, ensuring the rescaled variable has a mean of zero and a variance of one. However, the variable can still be quite skewed, bimodal, etc. and insofar as the mean and variance may be affected by outliers in a given variate, the scaling can be too dramatic. One alternative intended to handle outliers better is robust_scale(), which uses the median and the inter-quartile range in the same fashion:

\[ z = \frac{x_i - \tilde{x}}{\lceil x \rceil_{75} - \lceil x \rceil_{25}}\]

where \(\lceil x \rceil_p\) represents the value of the \(p\)th percentile of \(x\). Alternatively, sometimes it is useful to ensure that the maximum of a variate is \(1\) and the minimum is zero. In this instance, the minmax_scale() is appropriate:

\[ z = \frac{x - min(x)}{max(x-min(x))} \]

In most clustering problems, the robust_scale() or scale() methods are useful. Further, transformations of the variate (such as log-transforming or Box-Cox transforms) can be used to non-linearly rescale the variates, but these generally should be done before the above kinds of scaling. Here, we will analyze robust-scaled variables. To detach the scaling from the analysis, we will perform the former now, creating a scaled view of our data which we can use later for clustering. For this, we import the scaling method:

from sklearn.preprocessing import robust_scale

Geodemographic clusters in san diego census tracts#

Geodemographic analysis is a form of multivariateclustering where the observations represent geographical areas [WB18]. The outputof these clusterings is nearly always mapped. Altogether, these methods usemultivariate clustering algorithms to construct a known number ofclusters (\(k\)), where the number of clusters is typically much smaller than thenumber of observations to be clustered. Each cluster is given a unique label,and these labels are mapped. Using the clusters’ profile and label, the map oflabels can be interpreted to get a sense of the spatial distribution ofsocio-demographic traits. The power of (geodemographic) clustering comesfrom taking statistical variation across several dimensions and compressing itinto a single categorical one that we can visualize through a map. Todemonstrate the variety of approaches in clustering, we will show twodistinct but very popular clustering algorithms: k-means and Ward’s hierarchical method.

K-means#

K-means is probably the most widely used approach tocluster a dataset. The algorithm groups observations into apre-specified number of clusters so that each observation iscloser to the mean of its own cluster than it is to the mean of any other cluster.The k-means problem is solved by iterating between an assignment step and an update step.First, all observations are randomly assigned one of the \(k\) labels. Next, themultivariate mean over all covariates is calculated for each of the clusters.Then, each observation is reassigned to the cluster with the closest mean.If the observation is already assigned to the cluster whose mean it is closest to,the observation remains in that cluster. This assignment-update process continuesuntil no further reassignments are necessary.

The nature of this algorithm requires us to select the number of clusters wewant to create. The right number of clusters is unknown in practice. Forillustration, we will use \(k=5\) in the KMeans implementation fromscikit-learn.

# Initialize KMeans instancefrom sklearn.cluster import KMeans

This illustration will also be useful as virtually every algorithm in scikit-learn,the (Python) standard library for machine learning, can be run in a similar fashion.To proceed, we first create a KMeans clusterer object that contains the description ofall the parameters the algorithm needs (in this case, only the number of clusters):

# Initialize KMeans instancekmeans = KMeans(n_clusters=5)

Next, we set the seed for reproducibility and call the fit method to compute the algorithm specified in kmeans to our scaled data:

# Set the seed for reproducibilitynumpy.random.seed(1234)# Run K-Means algorithmk5cls = kmeans.fit(db_scaled)

Now that the clusters have been assigned, we can examine the label vector, whichrecords the cluster to which each observation is assigned:

# Print first five labelsk5cls.labels_[:5]

array([2, 1, 3, 1, 4], dtype=int32)

In this case, the first observation is assigned to cluster 2, the second and fourth ones are assigned to cluster 1, the third to number 3 and the fifth receives the label 4. It is importantto note that the integer labels should be viewed as denoting membership only —the numerical differences between the values for the labels are meaningless.The profiles of the various clusters must be further explored by lookingat the values of each dimension. But, before we do that, let’s make a map.

Spatial distribution of clusters#

Having obtained the cluster labels, Figure XXX3XXX displays the spatialdistribution of the clusters by using the labels as the categories in achoropleth map. This allows us to quickly grasp any sort of spatial pattern theclusters might have. Since clusters represent areas with similarcharacteristics, mapping their labels allows to see to what extent similar areas tendto have similar locations.Thus, this gives us one map that incorporates the information from all nine covariates.

# Assign labels into a columndb["k5cls"] = k5cls.labels_# Set up figure and axf, ax = plt.subplots(1, figsize=(9, 9))# Plot unique values choropleth including# a legend and with no boundary linesdb.plot( column="k5cls", categorical=True, legend=True, linewidth=0, ax=ax)# Remove axisax.set_axis_off()# Display the mapplt.show()

The map provides a useful view of the clustering results; it allows fora visual inspection of the extent to which Tobler’s first law of geography isreflected in the multivariate clusters. Recall that the law implies that nearbytracts should be more similar to one another than tracts that are geographicallymore distant from each other. We can see evidence of this inour cluster map, since clumps of tracts with the same color emerge. However, thisvisual inspection is obscured by the complexity of the underlying spatialunits. Our eyes are drawn to the larger polygons in the eastern part of thecounty, giving the impression that more observations fall into that cluster. While thisseems to be true in terms of land area (and we will verify this below), there ismore to the cluster pattern than this. Because the tract polygons are alldifferent sizes and shapes, we cannot solely rely on our eyes to interpretthe spatial distribution of clusters.

Statistical analysis of the cluster map#

To complement the geovisualization of the clusters, we can explore thestatistical properties of the cluster map. This process allows us to delveinto what observations are part of each cluster and what theircharacteristics are.This gives us the profile of each cluster so we can interpret the meaning of thelabels we’ve obtained. We can start, for example, byconsidering cardinality, or the count of observations in each cluster:

# Group data table by cluster label and count observationsk5sizes = db.groupby("k5cls").size()k5sizes

k5cls0 2481 2442 883 394 9dtype: int64

There are substantial differences in the sizes of the five clusters, with two verylarge clusters (0,1), one medium-sized cluster (2), and two small clusters (3,4). Cluster 0 is the largest when measured by the number of assigned tracts, but cluster 1 is not far behind. This confirms our discussion from the map above, where we got the visual impression that tracts in cluster 1 seemed to have the largest area by far, but we missed exactly how large cluster 0 would be.

Let’s see if this is the case. One way to do so involves using the dissolve operation in geopandas, whichcombines all tracts belonging to each cluster into a singlepolygon object. After we have dissolved all the members of the clusters,we report the total land area of the cluster:

# Dissolve areas by Cluster, aggregate by summing,# and keep column for areaareas = db.dissolve(by="k5cls", aggfunc="sum")["area_sqm"]areas

k5cls0 739.1844781 8622.4818142 1335.7214923 315.4283014 708.956558Name: area_sqm, dtype: float64

We can then use cluster shares to show visually in Figure XXX4XXX a comparison of the two membership representations (based on land and tracts):

# Bind cluster figures in a single tablearea_tracts = pandas.DataFrame({"No. Tracts": k5sizes, "Area": areas})# Convert raw values into percentagesarea_tracts = area_tracts * 100 / area_tracts.sum()# Bar plotax = area_tracts.plot.bar()# Rename axesax.set_xlabel("Cluster labels")ax.set_ylabel("Percentage by cluster");

Our visual impression from the map is confirmed: cluster 1 contains tracts thattogether comprise 8622 square miles (about 22,330 square kilometers)which accounts for well over half of the total land area in the county:

areas[1] / areas.sum()

0.7355953810798617

Let’s move on to build the profiles for each cluster. Again, the profiles is whatprovides the conceptual shorthand, moving from the arbitrary label to a meaningfulcollection of observations with similar attributes. To build a basic profile, we can compute the (unscaled) means of each of the attributes in every cluster:

# Group table by cluster label, keep the variables used# for clustering, and obtain their meank5means = db.groupby("k5cls")[cluster_variables].mean()# Transpose the table and print it rounding each value# to three decimalsk5means.T.round(3)

k5cls	0	1	2	3	4
median_house_value	356997.331	538463.934	544888.738	1292905.256	609385.655
pct_white	0.620	0.787	0.741	0.874	0.583
pct_rented	0.551	0.270	0.596	0.275	0.377
pct_hh_female	0.108	0.114	0.065	0.109	0.095
pct_bachelor	0.023	0.007	0.005	0.002	0.007
median_no_rooms	4.623	5.850	4.153	6.100	5.800
income_gini	0.400	0.397	0.449	0.488	0.391
median_age	32.783	42.057	32.590	46.356	33.500
tt_work	2238.883	2244.320	2349.511	1746.410	9671.556

Note in this case we do not use scaled measures. This is to create profiles that are easier to interpret and relate to. We see that cluster 3, for example, is composed of tracts that havethe highest average median_house_value, and also the highest level of inequality(income_gini); and cluster 0 contains a younger population (median_age)who tend to live in housing units with fewer rooms (median_no_rooms).For interpretability, it is useful to consider the raw features, rather than scaled versions that the clusterer sees. However, you can also give profiles in terms of rescaled features.

Average values, however, can hide a great deal of detail and, in some cases,give wrong impressions about the type of data distribution they represent. Toobtain more detailed profiles, we could use the describe command in pandas,after grouping our observations by their clusters:

#-----------------------------------------------------------## Illustrative code only, not executed#-----------------------------------------------------------## Group table by cluster label, keep the variables used # for clustering, and obtain their descriptive summaryk5desc = db.groupby('k5cls')[cluster_variables].describe()# Loop over each cluster and print a table with descriptivesfor cluster in k5desc.T: print('\n\t---------\n\tCluster %i'%cluster) print(k5desc.T[cluster].unstack())#-----------------------------------------------------------#

However, this approach quickly gets out of hand: more detailed profiles can simplyreturn to an unwieldy mess of numbers. A better way of constructingcluster profiles is to draw the distributions of cluster members’ data.To do this, we need to “tidy up” the dataset. A tidy dataset [W+14]is one where every row is an observation, and every column is a variable. This is akin to the long-format referred to in Chapter 9, and contrasts with the wide-format we used when looking at inequality over time. A few steps are required to tidy up our labeled data:

# Index db on cluster IDtidy_db = db.set_index("k5cls")# Keep only variables used for clusteringtidy_db = tidy_db[cluster_variables]# Stack column names into a column, obtaining# a "long" version of the datasettidy_db = tidy_db.stack()# Take indices into proper columnstidy_db = tidy_db.reset_index()# Rename column namestidy_db = tidy_db.rename( columns={"level_1": "Attribute", 0: "Values"})# Check out resulttidy_db.head()

	k5cls	Attribute	Values
0	2	median_house_value	732900.000000
1	2	pct_white	0.916988
2	2	pct_rented	0.373913
3	2	pct_hh_female	0.052896
4	2	pct_bachelor	0.000000

Now we are ready to plot. Figure XXX5XXX, generated with the code below, shows the distribution of each cluster’s valuesfor each variable. This gives us the full distributional profile of each cluster:

# Scale fonts to make them more readableseaborn.set(font_scale=1.5)# Setup the facetsfacets = seaborn.FacetGrid( data=tidy_db, col="Attribute", hue="k5cls", sharey=False, sharex=False, aspect=2, col_wrap=3,)# Build the plot from `sns.kdeplot`_ = facets.map(seaborn.kdeplot, "Values", shade=True).add_legend()

Note that we create the figure using the facetting functionality in seaborn, whichstreamlines notably the process to create multi-plot figures whose dimensions andcontent are data-driven. This happens in two steps: first, we set up the frame (facets),and then we “map” a function (seaborn.kdeplot) to the data, within such frame.

The figure allows us to see that, while some attributes such as the percentage offemale households (pct_hh_female) display largely the same distribution foreach cluster, others paint a much more divided picture (e.g., median_house_value).Taken altogether, these graphs allow us to start delving into the multi-dimensionalcomplexity of each cluster and the types of areas behind them.

Hierarchical Clustering#

As mentioned above, k-means is only one clustering algorithm. There areplenty more. In this section, we will take a similar look at the San Diegodataset using another staple of the clustering toolkit: agglomerativehierarchical clustering (AHC). Agglomerative clustering works by building a hierarchy ofclustering solutions that starts with all singletons (each observation is a singlecluster in itself) and ends with all observations assigned to the same cluster.These extremes are not very useful in themselves. But, in between, the hierarchycontains many distinct clustering solutions with varying levels of detail.The intuition behind the algorithm is also rather straightforward:

begin with everyone as part of its own cluster;
find the two closest observations based on a distance metric (e.g., Euclidean);
join them into a new cluster;
repeat steps (2) and (3) until reaching the degree of aggregation desired.

The algorithm is thus called “agglomerative”because it starts with individual clusters and “agglomerates” them into fewerand fewer clusters containing more and more observations each. Also, like withk-means, AHC requires the user to specify a number of clusters in advance.This is because, following from the mechanism the method has to build clusters,AHC can provide a solution with as many clusters as observations (\(k=n\)),or with only one (\(k=1\)).

Enough of theory, let’s get coding! In Python, AHC can be runwith scikit-learn in very much the same way we did for k-means in the previoussection. First we need to import it:

from sklearn.cluster import AgglomerativeClustering

In this case, we use the AgglomerativeClustering class and againuse the fit method to actually apply the clustering algorithm to our data:

# Set seed for reproducibilitynumpy.random.seed(0)# Initialize the algorithmmodel = AgglomerativeClustering(linkage="ward", n_clusters=5)# Run clusteringmodel.fit(db_scaled)# Assign labels to main data tabledb["ward5"] = model.labels_

As above, we can check the number of observations that fall within each cluster:

ward5sizes = db.groupby("ward5").size()ward5sizes

ward50 1981 102 483 2874 85dtype: int64

Further, we can check the simple average profiles of our clusters:

ward5means = db.groupby("ward5")[cluster_variables].mean()ward5means.T.round(3)

ward5	0	1	2	3	4
median_house_value	365932.350	625607.090	1202087.604	503608.711	503905.198
pct_white	0.589	0.598	0.871	0.770	0.766
pct_rented	0.573	0.360	0.285	0.287	0.657
pct_hh_female	0.105	0.098	0.107	0.112	0.076
pct_bachelor	0.023	0.006	0.002	0.009	0.006
median_no_rooms	4.566	5.860	6.010	5.738	3.904
income_gini	0.405	0.394	0.480	0.394	0.442
median_age	31.955	34.250	45.196	40.695	33.540
tt_work	2181.970	9260.400	1766.354	2268.718	2402.671

And again, we can tidy our dataset:

# Index db on cluster IDtidy_db = db.set_index("ward5")# Keep only variables used for clusteringtidy_db = tidy_db[cluster_variables]# Stack column names into a column, obtaining# a "long" version of the datasettidy_db = tidy_db.stack()# Take indices into proper columnstidy_db = tidy_db.reset_index()# Rename column namestidy_db = tidy_db.rename( columns={"level_1": "Attribute", 0: "Values"})# Check out resulttidy_db.head()

	ward5	Attribute	Values
0	4	median_house_value	732900.000000
1	4	pct_white	0.916988
2	4	pct_rented	0.373913
3	4	pct_hh_female	0.052896
4	4	pct_bachelor	0.000000

And create a plot of the profiles’ distributions (Fig. XXX6XXX):

# Setup the facetsfacets = seaborn.FacetGrid( data=tidy_db, col="Attribute", hue="ward5", sharey=False, sharex=False, aspect=2, col_wrap=3,)# Build the plot as a `sns.kdeplot`facets.map(seaborn.kdeplot, "Values", shade=True).add_legend();

For the sake of brevity, we will not spend much time on the plots above.However, the interpretation is analogous to that of the k-means example.

On the spatial side, we can explore the geographical dimension of theclustering solution by making a map of the clusters. To make the comparisonwith k-means simpler, Figure XXX7XXX, generated with the code below, displays both side-by-side:

db["ward5"] = model.labels_# Set up figure and axf, axs = plt.subplots(1, 2, figsize=(12, 6))### K-Means ###ax = axs[0]# Plot unique values choropleth including# a legend and with no boundary linesdb.plot( column="ward5", categorical=True, cmap="Set2", legend=True, linewidth=0, ax=ax,)# Remove axisax.set_axis_off()# Add titleax.set_title("K-Means solution ($k=5$)")### AHC ###ax = axs[1]# Plot unique values choropleth including# a legend and with no boundary linesdb.plot( column="k5cls", categorical=True, cmap="Set3", legend=True, linewidth=0, ax=ax,)# Remove axisax.set_axis_off()# Add titleax.set_title("AHC solution ($k=5$)")# Display the mapplt.show()

While we must remember our earlier caveat about how irregular polygons canbaffle our visual intuition, a closer visual inspection of the cluster geographysuggests a clear pattern: although they are not identical, both clustering solutions capturevery similar overall spatial structure. Furthermore, both solutions slightly violateTobler’s law in the sense all of the clusters have disconnected components. The fivemultivariate clusters in each case are actually composed of many disparategeographical areas, strewn around the map according only to the structure of thedata and not its geography. That is, in order to travel toevery tract belonging to a cluster, we would have to journey throughother clusters as well.

Regionalization: spatially constrained hierarchical clustering#

Contiguity constraint#

Fragmented clusters are not intrinsically invalid, particularly if we areinterested in exploring the overall structure and geography of multivariatedata. However, in some cases, the application we are interested in mightrequire that all the observations in a class be spatially connected. Forexample, when detecting communities or neighborhoods (as is sometimes needed whendrawing electoral or census boundaries), they are nearly always distinctself-connected areas, unlike our clusters shown above. To ensure that clusters arenot spatially fragmented, we turn to regionalization.

Regionalization methods are clustering techniques that impose a spatial constrainton clusters. In other words, the result of a regionalization algorithm contains clusters withareas that are geographically coherent, in addition to having coherent data profiles.Effectively, this means that regionalization methods construct clusters that areall internally connected; these are the regions. Thus, a region’s members mustbe geographically nested within the region’s boundaries.

This type of nesting relationship is easy to identifyin the real world. Census geographies provide good examples: counties nest within statesin the U.S.; or local super output areas (LSOAs) nest within middle super output areas(MSOAs) in the UK.The difference between these real-world nestings and the output of a regionalizationalgorithm is that the real-world nestings are aggregated according to administrativeprinciples, while regions’ members are aggregated according to statistical similarity. In the same manner as theclustering techniques explored above, these regionalization methods aggregateobservations that are similar in their attributes; the profiles of regions are usefulin a similar manner as the profiles of clusters. But, in regionalization, theclustering is also spatially constrained, so the region profiles and members willlikely be different from the unconstrained solutions.

As in the non-spatial case, there are many different regionalization methods.Each has a different way to measure (dis)similarity, how the similarity is usedto assign labels, how these labels are iteratively adjusted, and so on. However,as with clustering algorithms, regionalization methods all share a few common traits.In particular, they all take a set of input attributes and a representation ofspatial connectivity in the form of a binary spatial weights matrix. Dependingon the algorithm, they also require the desired number of output regions. Forillustration, we will take the AHC algorithm we have just used above and applyan additional spatial constraint. In scikit-learn, this is done usingour spatial weights matrix as a connectivity option.This parameter will force the agglomerative algorithm to only allow observations to be groupedin a cluster if they are also spatially connected:

# Set the seed for reproducibilitynumpy.random.seed(123456)# Specify cluster model with spatial constraintmodel = AgglomerativeClustering( linkage="ward", connectivity=w.sparse, n_clusters=5)# Fit algorithm to the datamodel.fit(db_scaled)

AgglomerativeClustering(connectivity=<628x628 sparse matrix of type '<class 'numpy.float64'>'with 4016 stored elements in Compressed Sparse Row format>, n_clusters=5)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Let’s inspect the output visually (Fig. XXX8XXX):

db["ward5wq"] = model.labels_# Set up figure and axf, ax = plt.subplots(1, figsize=(9, 9))# Plot unique values choropleth including a legend and with no boundary linesdb.plot( column="ward5wq", categorical=True, legend=True, linewidth=0, ax=ax,)# Remove axisax.set_axis_off()# Display the mapplt.show()

Introducing the spatial constraint results in fully connected clusters with muchmore concentrated spatial distributions. From an initial visual impression, it mightappear that our spatial constraint has been violated: there are tracts for both cluster 0 andcluster 1 that appear to be disconnected from the rest of their clusters.However, closer inspection reveals that each of these tracts is indeed connectedto another tract in its own cluster by very narrow shared boundaries.

Changing the spatial constraint#

The spatial constraint in regionalization algorithms is structured by thespatial weights matrix we use. An interestingquestion is thus how the choice of weights influences the final region structure.Fortunately, we can directly explore the impact that a change in the spatial weights matrix has onregionalization. To do so, we use the same attribute databut replace the Queen contiguity matrix with a spatial k-nearest neighbor matrix,where each observation is connected to its four nearest observations, insteadof those it touches.

w = KNN.from_dataframe(db, k=4)

With this matrix connecting each tract to the four closest tracts, we can runanother AHC regionalization:

# Set the seed for reproducibilitynumpy.random.seed(123456)# Specify cluster model with spatial constraintmodel = AgglomerativeClustering( linkage="ward", connectivity=w.sparse, n_clusters=5)# Fit algorithm to the datamodel.fit(db_scaled)

AgglomerativeClustering(connectivity=<628x628 sparse matrix of type '<class 'numpy.float64'>'with 2512 stored elements in Compressed Sparse Row format>, n_clusters=5)

And plot the final regions (Fig. XXX9XXX):

Show code cell sourceHide code cell source

db["ward5wknn"] = model.labels_# Setup figure and axf, ax = plt.subplots(1, figsize=(9, 9))# Plot unique values choropleth# including a legend and with no boundary linesdb.plot( column="ward5wknn", categorical=True, legend=True, linewidth=0, ax=ax,)# Remove axisax.set_axis_off()# Display the mapplt.show()

Even though we have specified a spatial constraint, the constraint applies to theconnectivity graph modeled by our weights matrix. Therefore, using k-nearest neighborsto constrain the agglomerative clustering may not result in regions that are connectedaccording to a different connectivity rule, such as the queen contiguity rule usedin the previous section. However, the regionalization here is fortuitous; even thoughwe used the 4-nearest tracts to constrain connectivity, all of our clusters are also connected according to the Queen contiguity rule.

So, which one is a “better” regionalization? Well, regionalizations are often compared based on measures of geographical coherence, as well as measures of cluster coherence. The former involves measures of cluster shape that can answer to questions like “are clusters evenly sized, or are they very differently sized? Are clusters very strangely shaped, or are they compact?”;while the latter generally focuses on whether cluster observations are more similar to their current clusters than to other clusters. This goodness of fit is usually better for unconstrained clustering algorithms than for the corresponding regionalizations. We’ll show this next.

Geographical coherence#

One very simple measure of geographical coherence involves the “compactness” of a given shape. The most common of these measures is the isoperimetric quotient [HHV93]. This compares the area of the region to the area of a circle with the same perimeter as the region. To obtain the statistic, we can recognize that the circumference of the circle \(c\) is the same as the perimeter of the region \(i\), so \(P_i = 2\pi r_c\). Then, the area of the isoperimetric circle is \(A_c = \pi r_c^2 = \pi \left(\frac{P_i}{2 \pi}\right)^2\). Simplifying, we get:

\[ IPQ_i = \frac{A_i}{A_c} = \frac{4 \pi A_i}{P_i^2}\]

For this measure, more compact shapes have an IPQ closer to 1, whereas very elongated or spindly shapes will have IPQs closer to zero. For the clustering solutions, we would expect the IPQ to be very small indeed, since the perimeter of a cluster/region gets smaller the more boundaries that members share.

Computing this, then, can be done directly from the area and perimeter of a region:

results = []for cluster_type in ("k5cls", "ward5", "ward5wq", "ward5wknn"): # compute the region polygons using a dissolve regions = db[[cluster_type, "geometry"]].dissolve(by=cluster_type) # compute the actual isoperimetric quotient for these regions ipqs = ( regions.area * 4 * numpy.pi / (regions.boundary.length ** 2) ) # cast to a dataframe result = ipqs.to_frame(cluster_type) results.append(result)# stack the series together along columnspandas.concat(results, axis=1)

	k5cls	ward5	ward5wq	ward5wknn
0	0.022799	0.018571	0.074571	0.139216
1	0.063241	0.149260	0.129419	0.245486
2	0.038124	0.046086	0.542491	0.050342
3	0.048154	0.064358	0.281874	0.542491
4	0.161781	0.027323	0.112495	0.123690

From this, we can see that the shape measures for the clusters are much better under the regionalizations than under the clustering solutions. As we’ll show in the next section, this comes at the cost of goodness of fit. Alternatively, the two spatial solutions have different compactness values; the knn-based regions are much more compact than the queen weights-based solutions. The most compact region in the Queen regionalization is about at the median of the knn solutions.

Many other measures of shape regularity exist. Most of the well-used ones are implemented in the esda.shapestats module, which also documents the sensitivity of the different measures of shape.

Feature coherence (goodness of fit)#

Many measures of the feature coherence, or goodness of fit, are implemented in scikit-learn’s metrics module, which we used earlier to compute distances. This metrics module also contains a few goodness of fit statistics that measure, for example:

metrics.calinski_harabasz_score() (CH): the within-cluster variance divided by the between-cluster variance.
metrics.silhouette_score(): the average standardized distance from each observation to its “next best fit” cluster—the most similar cluster to which the observation is not currently assigned.

To compute these, each scoring function requires both the original data and the labels which have been fit. We’ll compute the CH score for all the different clusterings below:

ch_scores = []for cluster_type in ("k5cls", "ward5", "ward5wq", "ward5wknn"): # compute the CH score ch_score = metrics.calinski_harabasz_score( # using scaled variables robust_scale(db[cluster_variables]), # using these labels db[cluster_type], ) # and append the cluster type with the CH score ch_scores.append((cluster_type, ch_score))# re-arrange the scores into a dataframe for displaypandas.DataFrame( ch_scores, columns=["cluster type", "CH score"]).set_index("cluster type")

	CH score
cluster type
k5cls	115.118055
ward5	98.529245
ward5wq	62.518714
ward5wknn	54.378576

For all functions in metrics that end in “score”, higher numbers indicate greater fit, whereas functions that end in loss work in the other direction. Thus, the K-means solution has the highest Calinski-Harabasz score, while the ward clustering comes second. The regionalizations both come well below the clusterings, too. As we said before, the improved geographical coherence comes at a pretty hefty cost in terms of feature goodness of fit. This is because regionalization is constrained, and mathematically cannot achieve the same score as the unconstrained K-means solution, unless we get lucky and the k-means solution is a valid regionalization.

Solution similarity#

The metrics module also contains useful tools to compare whether the labelings generated from different clustering algorithms are similar, such as the Adjusted Rand Score or the Mutual Information Score. To show that, we can see how similar clusterings are to one another:

ami_scores = []# for each cluster solutionfor i_cluster_type in ("k5cls", "ward5", "ward5wq", "ward5wknn"): # for every other clustering for j_cluster_type in ("k5cls", "ward5", "ward5wq", "ward5wknn"): # compute the adjusted mutual info between the two ami_score = metrics.adjusted_mutual_info_score( db[i_cluster_type], db[j_cluster_type] ) # and save the pair of cluster types with the score ami_scores.append((i_cluster_type, j_cluster_type, ami_score))# arrange the results into a dataframeresults = pandas.DataFrame( ami_scores, columns=["source", "target", "similarity"])# and spread the dataframe out into a squareresults.pivot("source", "target", "similarity")

target	k5cls	ward5	ward5wknn	ward5wq
source
k5cls	1.000000	0.574792	0.267554	0.302755
ward5	0.574792	1.000000	0.250029	0.258057
ward5wknn	0.267554	0.250029	1.000000	0.648272
ward5wq	0.302755	0.258057	0.648272	1.000000

From this, we can see that the K-means and Ward clusterings are the most self-similar, and the two regionalizations are slightly less similar to one another than the clusterings. The regionalizations are generally not very similar to the clusterings, as would be expected from our discussions above.

Conclusion#

Overall, clustering and regionalization are two complementary tools to reducecomplexity in multivariate data and build better understandings of their spatial structure.Often, there is simply too much data to examine every variable’s map and itsrelation to all other variable maps.Thus, clustering reduces this complexity into a single conceptual shorthand by whichpeople can easily describe complex and multi-faceted data.Clustering constructs groups of observations (called clusters)with coherent profiles, or distinct and internally consistentdistributional/descriptive characteristics.These profiles are the conceptual shorthand, since members of each cluster shouldbe more similar to the cluster at large than they are to any other cluster.Many different clustering methods exist; they differ on how the clusteris defined, and how “similar” members must be to clusters, or how these clustersare obtained.Regionalization is a special kind of clustering that imposes an additional geographic requirement.Observations should be grouped so that each spatial cluster,or region, is spatially coherent as well as data-coherent.Thus, regionalization is often concerned with connectivity in a contiguitygraph for data collected in areas; this ensures that the regions that are identifiedare fully internally connected.However, since many regionalization methods are defined for an arbitrary connectivity structure,these graphs can be constructed according to different rules as well, such as the k-nearest neighbor graph.Finally, while regionalizations are usually more geographically coherent, they are also usually worse-fit to the features at hand. This reflects an intrinsic tradeoff that, in general, cannot be removed.

In this chapter, we discussed the conceptual basis for clustering and regionalization,as well as showing why clustering is done.Further, we have demonstrated how to build clusters using a combination of (geographic) datascience packages, and how to interrogate the meaning of these clusters as well.More generally, clusters are often used in predictive and explanatory settings,in addition to being used for exploratory analysis in their own right.Clustering and regionalization are intimately related to the analysis of spatial autocorrelation as well,since the spatial structure and covariation in multivariate spatial data is whatdetermines the spatial structure and data profile of discovered clusters or regions.Thus, clustering and regionalization are essential tools for the geographic data scientist.

Questions#

What disciplines employ regionalization? Cite concrete examples for each discipline you list.
Contrast and compare the concepts of clusters and regions?
In evaluating the quality of the solution to a regionalization problem, how might traditional measures of cluster evaluation be used? In what ways might those measures be limited and need expansion to consider the geographical dimensions of the problem?
Discuss the implications for the processes of regionalization that follow from the number of connected components in the spatial weights matrix that would be used.
Consider two possible weights matrices for use in a spatially constrained clustering problem. Both form a single connected component for all the areal units. However, they differ in the sparsity of their adjacency graphs (think Rook being less dense than Queen graphs).
a. How might the sparsity of the weights matrix affect the quality of the clustering solution?
b. Using pysal.lib.weights.higher_order, construct a second-order adjacency matrix of the weights matrix used in this chapter.
c. Compare the pct_nonzero for both matrices.
d. Rerun the analysis from this chapter using this new second-order weights matrix. What changes?
The idea of spatial dependence, that near things tend to be more related than distant things, is an extensively studied property of spatial data. How might solutions to clustering and regionalization problems change if dependence is very strong and positive? very weak? very strong and negative?
Using a spatial weights object obtained as w = pysal.lib.weights.lat2W(20,20), what are the number of unique ways to partition the graph into 20 clusters of 20 units each, subject to each cluster being a connected component? What are the unique numbers of possibilities for w = pysal.lib.weights.lat2W(20,20, rook=False)?

Next steps#

For a “classical” introduction to clustering methods in arbitrary data science problems, it is difficult to beat the Introduction to Statistical Learning:

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2021. Introduction to Statistical Learning (2nd Edition). Wiley: New York.

For regionalization problems and methods, a useful discussion of the theory and operation of various heuristics and methods is provided by:

Duque, Juan Carlos, Raúl Ramos, and Jordi Suriñach. 2007. “Supervised Regionalization Methods: A survey.” International Regional Science Review 30(3): 195-220.

Finally, methods for geodemographics are comprehensively covered in the book by:

Harris, Rich, Peter Sleight, and Richard Webber. 2005. Geodemographics, GIS, and Neighbourhood Targeting. Wiley.

And a more recent overview and discussion can also be provided by:

Singleton, Alex and Seth Spielman. 2014. “The past, present, and future of geodemographic research in the United States and the United Kingdom.” The Professional Geographer 66(4): 558-567.