Geocoding and custom subsets¶

pygris offers an interface to the US Census Bureau's Geocoding and GeoLookup APIs. The Census geocoder is typically not as accurate as commercial alternatives as it is based on street centerlines geocoding referenced to the TIGER/Line database; see Prener and Fox (2021) for more details. For analysts who need to do geocoding in the United States, however, it is a useful tool for a few reasons:

It is free to use without an API key;
It offers both single-address and batch geocoding options;
It can return the GEOID of the Census geographic unit in which a given address or coordinate pair falls.

Geocoding functionality in pygris is found in the geocode module. The geocode() function takes either a single-line address or an address specified as arguments and returns a Pandas dataframe of possible matches, subject to the value of the limit argument.

In [1]:

Copied!

from pygris.geocode import geocode

geocode(address = "1600 Pennsylvania Ave NW, Washington DC")
from pygris.geocode import geocode

geocode(address = "1600 Pennsylvania Ave NW, Washington DC")

Out[1]:

	longitude	latitude	GEOID	address
0	-77.036544	38.898691	110019800001034	1600 Pennsylvania Ave NW, Washington DC

The estimated longitude and latitude of the location, along with the estimated Census block GEOID, are returned. As a convenience, users can also specify the option as_gdf = True to return a GeoDataFrame for mapping or analysis.

In [2]:

Copied!

geocode(address = "1600 Pennsylvania Ave NW, Washington DC", as_gdf = True).explore(marker_type = "marker")
geocode(address = "1600 Pennsylvania Ave NW, Washington DC", as_gdf = True).explore(marker_type = "marker")

Out[2]:

Make this Notebook Trusted to load map: File -> Trust Notebook

Given that the Census geocoder uses street centerlines geocoding, not rooftop geocoding, the point will be placed on the street near the location, not necessarily right on top of it.

The geolookup() function can be used to find the Census GEOID of a location defined by X and Y coordinates.

In [3]:

Copied!

from pygris.geocode import geolookup

geolookup(longitude = -98.90629, latitude= 32.75639)
from pygris.geocode import geolookup

geolookup(longitude = -98.90629, latitude= 32.75639)

Out[3]:

	GEOID	longitude	latitude
0	484299502001041	-98.90629	32.75639

Batch geocoding of addresses in a Pandas DataFrame can be accomplished with the batch_geocode function. batch_geocode() takes columns describing addresses and sends them to the Census Bureau's batch geocoding API. The object returned includes estimated longitude and latitude coordinates for each address along with contextual information about the Census geographies that the addresses fall within.

Let's take a look at some sample addresses from New York City.

In [4]:

Copied!





import pandas as pd
from pygris.geocode import batch_geocode

my_addresses = pd.DataFrame(
    {"building": ["Chrysler Building", "Empire State Building", "Flatiron Building"],
     "address": ["405 Lexington Ave", "20 W 34th St", "175 5th Ave"],
     "city": "New York",
     "state": "New York",
     "zip": ["10174", "10018", "10010"]}
)

my_addresses
import pandas as pd
from pygris.geocode import batch_geocode

my_addresses = pd.DataFrame(
    {"building": ["Chrysler Building", "Empire State Building", "Flatiron Building"],
     "address": ["405 Lexington Ave", "20 W 34th St", "175 5th Ave"],
     "city": "New York",
     "state": "New York",
     "zip": ["10174", "10018", "10010"]}
)

my_addresses

Out[4]:

	building	address	city	state	zip
0	Chrysler Building	405 Lexington Ave	New York	New York	10174
1	Empire State Building	20 W 34th St	New York	New York	10018
2	Flatiron Building	175 5th Ave	New York	New York	10010

We can geocode the addresses by passing the dataframe to the batch_geocode() function and mapping the columns as arguments to the function appropriately.

In [5]:

Copied!

my_points = batch_geocode(my_addresses, id_column = "building",
                          address = "address", city = "city", state = "state",
                          zip = "zip", as_gdf = True)

my_points
my_points = batch_geocode(my_addresses, id_column = "building",
                          address = "address", city = "city", state = "state",
                          zip = "zip", as_gdf = True)

my_points

Out[5]:

	id	address	status	match_quality	matched_address	tiger_line_id	tiger_side	state	county	tract	block	longitude	latitude	geometry
0	Flatiron Building	175 5th Ave, New York, New York, 10010	Match	Exact	175 5TH AVE, NEW YORK, NY, 10010	59653610	R	36	61	5600	2002	-73.989966723772	40.7408989937	POINT (-73.98997 40.7409)
1	Empire State Building	20 W 34th St, New York, New York, 10018	Match	Non_Exact	20 W 34TH ST, NEW YORK, NY, 10118	59653429	L	36	61	7600	1001	-73.985242583802	40.748653379015	POINT (-73.98524 40.74865)
2	Chrysler Building	405 Lexington Ave, New York, New York, 10174	Match	Exact	405 LEXINGTON AVE, NEW YORK, NY, 10174	59658691	R	36	61	9200	1003	-73.975868909053	40.751569772735	POINT (-73.97587 40.75157)

An interactive map helps us audit the geocoding result:

In [6]:

Copied!

my_points.explore(marker_type = "marker")

my_points.explore(marker_type = "marker")

Out[6]:

Make this Notebook Trusted to load map: File -> Trust Notebook

While the geocoded points for the Chrysler Building and the Empire State Building are right next to the actual buildings, the Flatiron Building is geocoded one block away. This underscores the point that users who need very high accuracy should probably look to other geocoding resources. However, if these results are tolerable, the Census geocoder offers a compelling option for users who require a free batch geocoder.

Custom subsets of Census shapefiles¶

Users of pygris will typically request data by state or county, which are the geographies at which Census shapefiles are usually available. However, there may be instances where more custom subsets are required. As pygris relies on geopandas.read_file() to read in shapefiles, it benefits from the options available to read in specific rows from those shapefiles only.

The parameter subset_by will return a custom subset of a Census shapefile depending on the type of object passed as an argument. Options are as follows:

If a user supplies a tuple of format (minx, miny, maxx, maxy), it will be interpreted as a bounding box and rows will be returned that intersect that bounding box;
If a user supplies a integer or a slice object, the first n rows (or the rows defined by the slice object) will be returned;
If a user supplies an object of type geopandas.GeoDataFrame or of type geopandas.GeoSeries, rows that intersect the input object will be returned. CRS misalignment will be resolved internally;
Users can even leverage the geocoding functionality in pygris to pass a dict object of format {"address": "buffer_dist"}, which will return geographies that intersect a given buffer distance (specified in meters) of a geocoded address.

Let's take a look at how this works. Say we want to retrieve Census tracts within 5km of the Texas State Capitol in Austin. This is straightforward with subset_by:

In [7]:

Copied!

from pygris import tracts

capitol_tracts = tracts(state = "TX", cb = True,
                        subset_by = {"1100 Congress Ave., Austin, TX 78701": 5000})

capitol_tracts.explore()
from pygris import tracts

capitol_tracts = tracts(state = "TX", cb = True,
                        subset_by = {"1100 Congress Ave., Austin, TX 78701": 5000})

capitol_tracts.explore()

Using the default year of 2024
Using FIPS code '48' for input 'TX'

Out[7]:

Make this Notebook Trusted to load map: File -> Trust Notebook

Getting data for multiple states is not quite as straightforward. Let's say you want all the tracts within 10km of Subaru Park, home of the Philadelphia Union soccer team in Chester, PA. This buffer will cross into New Jersey and into Delaware from Pennsylvania.

One method is to use a list comprehension to store state-wise tracts in a list, then use pandas.concat() to assemble the result:

In [8]:

Copied!

import pandas as pd

union_tracts_list = [tracts(cb = True, state = x, subset_by = {"2501 Seaport Dr, Chester, PA 19013": 10000}) for x in ['DE', 'PA', 'NJ']]

union_tracts = pd.concat(union_tracts_list)

union_tracts.explore()
import pandas as pd

union_tracts_list = [tracts(cb = True, state = x, subset_by = {"2501 Seaport Dr, Chester, PA 19013": 10000}) for x in ['DE', 'PA', 'NJ']]

union_tracts = pd.concat(union_tracts_list)

union_tracts.explore()

Using the default year of 2024
Using FIPS code '10' for input 'DE'

Using the default year of 2024
Using FIPS code '42' for input 'PA'

Using the default year of 2024
Using FIPS code '34' for input 'NJ'

Out[8]:

Make this Notebook Trusted to load map: File -> Trust Notebook

The other alternative is to pull down a national tracts shapefile by omitting the state and subsetting with subset_by. Given that the file will first have to be downloaded, it is recommended that you use cache = True to store this file in a local cache. Once you have it, however, you can subset for whatever locations you need, when you need!