Building custom regions from 2020 Census blocks in Python

python

gis

data science

Author

Kyle Walker

Published

June 26, 2023

Earlier this month, I gave a two-part workshop series on analyzing the newly-released 2020 Decennial US Census Data with R. If you missed out on the workshop series, you can buy the videos and materials on my Workshops page. One topic I addressed was how to handle the impact of differential privacy on block-level accuracy in the new Census Data.

Differential privacy refers to a method used by the Census Bureau to infuse “noise” into data products to preserve respondent confidentiality. Counts for larger areas and larger groups will still be accurate, but differential privacy makes smaller counts less reliable. In fact, the Census Bureau makes the following recommendation about block-level data:

DON’T use data for individual blocks. Instead, aggregate data into larger areas, or use statistical models that combine data from many blocks. Block data are published to permit the analysis of user-constructed geographic areas composed of multiple blocks, for example, new voting districts that consist of collections of blocks within a politically defined geography.

This isn’t likely to be satisfactory advice for analysts for a couple key reasons. First, analysts working with Census data in rural areas often need block-level data to understand demographic trends, as block groups (the next level up in the Census hierarchy) may be too large in sparsely-populated areas. Second, “aggregating data” is not as simple as it sounds in the quote. Creating data aggregations requires an understanding of techniques in GIS and data science that may be beyond the knowledge of the average Census data user.

In this post, I’ll illustrate a technique for creating custom regions from Census block data. We’ll be using the pygeoda package for this task, a Python wrapper of the C++ library that powers GeoDa, a GUI tool for exploratory spatial data analysis and spatial modeling. Working with GeoDa in this way is particularly fun for me. I was a qualitative geographer in graduate school before encountering GeoDa. GeoDa was the tool that sparked an interested in spatial data science for me and in many ways motivated my eventual career path.

Let’s grab some block data using pygris for Delta County, Texas, a rural county of about 5,000 residents northeast of the Dallas-Fort Worth metro area. If you haven’t previously cached the Texas block shapefile, this will take a few minutes to download.

import geopandas as gp
import pygeoda
from pygris import blocks, block_groups
from pygris.data import get_census

# Get the block data for a county in Texas
delta_blocks = blocks(state = "TX", county = "Delta", year = 2020, cache = True)

Using FIPS code '48' for input 'TX'

Using FIPS code '119' for input 'Delta'

Given that Delta County is fairly small, we can use .explore() to make a performant interactive map of the 571 Census blocks.

delta_blocks.explore(tooltip = False, popup = True)

Make this Notebook Trusted to load map: File -> Trust Notebook

There is one town of reasonable size in Delta County, Cooper. However, Census geography above the block level makes any sort of demographic analysis tricky. For example, we can briefly review block groups in Delta County:

delta_bgs = block_groups(state = "TX", county = "Delta", year = 2020, cache = True)

delta_bgs.explore(tooltip = False, popup = True)

Using FIPS code '48' for input 'TX'

Using FIPS code '119' for input 'Delta'

Make this Notebook Trusted to load map: File -> Trust Notebook

We see that the 571 blocks in Delta County are organized into only 4 block groups. Cooper is bisected by two block groups, both of which include area outside the built-up area of the town. This means that we can’t really use block groups to do more detailed analysis of Cooper’s in-town demographics, and any other settlements in the county (like Pecan Gap in the northwest) are subsumed by much larger block groups.

The solution we’ll use is regionalization. Regionalization is the process of building larger, aggregated areas from smaller geographies in ways that are spatially coherent and account for the characteristics of those small areas. To get started, let’s grab some demographic data from the new Demographic and Housing Characteristics file. While there are many ways to get Census data in Python, pygris has a lower-level function, get_census(), to help you grab data from the Census API to merge to your Census shapes.

We’ll get data on total population and the non-Hispanic white population from the DHC at the block level, requesting for Delta County.

delta_data = get_census(
    dataset = "dec/dhc",
    year = 2020,
    variables = ["P1_001N", "P5_003N"],
    params = {
        "for": "block:*",
        "in": ["state:48", "county:119"]
    },
    return_geoid = True,
    guess_dtypes = True
)

We can then merge our block-level Census data to our block geometries and calculate some derived columns. As the block shapes acquired with blocks() have a column on land area, ALAND20, we can calculate population density; we’ll also calculate the percentage of the block population that is non-Hispanic white. These columns will be used in the regionalization algorithm to cluster together demographically similar blocks.

delta_geo = delta_blocks[['GEOID20', 'geometry', 'ALAND20']].merge(delta_data, left_on = "GEOID20",
                                                                     right_on = "GEOID")

delta_geo["pop_density"] = delta_geo["P1_001N"] / delta_geo["ALAND20"]

delta_geo["percent_white"] = delta_geo["P5_003N"] / delta_geo["P1_001N"]

delta_geo.fillna(0, inplace = True)

delta_geo.head()

	GEOID20	geometry	ALAND20	P1_001N	P5_003N	GEOID	pop_density	percent_white
0	481199502001069	POLYGON ((-95.69040 33.37604, -95.68957 33.376...	5896	15	9	481199502001069	0.002544	0.600000
1	481199502001083	POLYGON ((-95.69610 33.36993, -95.69488 33.369...	12686	8	7	481199502001083	0.000631	0.875000
2	481199502002056	POLYGON ((-95.68872 33.37532, -95.68783 33.375...	6681	5	3	481199502002056	0.000748	0.600000
3	481199502001022	POLYGON ((-95.70546 33.37336, -95.70434 33.374...	84159	4	1	481199502001022	0.000048	0.250000
4	481199502002066	POLYGON ((-95.68616 33.37308, -95.68539 33.373...	6291	7	6	481199502002066	0.001113	0.857143

After preparing our Census data, we can now move to spatial analysis. pygeoda.open() convers a GeoPandas GeoDataFrame to an object suitable for use with pygeoda. Next, we’ll create spatial weights to represent spatial relationships between Census blocks. We’ll use rook weights, which means that blocks are considered to be neighbors if they share at least one line segment between them. This step is critical for regionalization as we want to ensure that our regions are spatially coherent.

delta_gda = pygeoda.open(delta_geo)

w = pygeoda.rook_weights(delta_gda)

w

Weights Meta-data:
 number of observations:                  571
           is symmetric:                 True
               sparsity: 0.008244361905404535
        # min neighbors:                    1
        # max neighbors:                   22
       # mean neighbors:   4.7075306479859895
     # median neighbors:                  4.0
           has isolates:                False

We get some basic information about the weights object. The least-connected block has 1 neighbor, and the most-connected block has 22; the median number of neighbors is 4.

With this information in hand, we can run the regionalization algorithm. The algorithm we’ll choose is Max-p. Max-P regionalization attempts to find the maximum number of clusters that are spatially contiguous and exceed a given size threshold while maximizing within-cluster homogeneity. We’ll use population density and percent non-Hispanic white as our clustering variables, and total population as our bounding variable. Setting min_bound to 100 tells the algorithm that each derived region must have at least 100 people in it. A couple other technical details: max-p is highly sensitive to the algorithm’s starting point, so it is recommended to set a random-number seed (and potentially evaluate results among multiple seeds). For more stable performance, cpu_threads = 1 should also be used.

cluster_variables = delta_gda[['pop_density', 'percent_white']]

bound_variable = delta_gda['P1_001N']

regions = pygeoda.maxp_greedy(
    w = w,
    data = cluster_variables,
    method = "fullorder-averagelinkage",
    bound_variable = bound_variable,
    min_bound = 100,
    random_seed = 1983,
    cpu_threads = 1
)

Thanks to the C++ back-end of pygeoda, the function is lightning-fast. The function returns a dict object with information about the regionalization solution; we’ll pluck the clusters out from it and assign it to our original GeoDataFrame as a column. We’ll then use a dissolve() operation to build new geographies from our regionalization solution, and calculate some derived statistics for those regions.

delta_geo['region'] = regions.get('Clusters')

delta_regions = delta_geo.dissolve(by = 'region', aggfunc = 'sum').reset_index()

delta_regions["pop_density"] = delta_regions["P1_001N"] / delta_regions["ALAND20"]

delta_regions["percent_white"] = delta_regions["P5_003N"] / delta_regions["P1_001N"]

print(f'Number of regions: {delta_regions.shape[0]}\nMinimum population: {delta_regions["P1_001N"].min()}')

Number of regions: 39
Minimum population: 101

We see that the algorithm built 39 regions in Delta County. The minimum population of any region is 101, which indicates that our minimum population threshold specification was satisfied. Let’s take a look at the derived regions:

delta_regions.explore(column = "region", categorical = True, legend = False)

Make this Notebook Trusted to load map: File -> Trust Notebook

Zooming into Cooper shows that we have several new geographies built from Census blocks in the town, allowing us to do more detailed demographic analysis. Additionally, Pecan Gap gets its own aggregated geography in this solution. We can also use these regions in future data work, as block IDs are mapped to regions in the delta_geo object we created.

Regionalization is a powerful tool in spatial data science, and it can be used to solve problems in a wide range of fields. If you want to learn more, check out my workshops or send me a note to discuss further!