ZCTA

The process of building an interactive map of ZIP codes using ZIP Code Tabulation Areas polygons.

Check out the map here or continue reading…

# What Do ZIP Codes Look Like?

I’ve been working with US address and census data lately, and recently found myself questioning what I actually knew about ZIP codes. That line of questioning didn’t lead where I thought it was going to go, but you can read more about my thoughts on that here). One place it did end up was wondering, what do ZIP codes actually look like?

After realizing that I had no geographic understanding of ZIP codes, I decided that I wanted a more intuitive way to look at them and try to make sense of what they are, what they look like and how they’re related to other geographical constructs I’m more familiar with, such as states, counties, census blocks, parcels, etc.

To satisfy my curiosity, I started playing around in my head with the idea of an interactive map with a simple interface, something that would just highlight the area of a ZIP code as you type it. I knew I was going to build a Mapbox GL JS map as my go-to, and with some tips and tricks I had picked up using expression filters in my San Juan Island parcels map, I had some good ideas in mind.

# We Need Boundaries

I didn’t realize this when I set out to make this map, but ZIP codes aren’t actually well-defined shapes. Instead, they define a specific post office or a rough collection of addresses served by a post office. All of these addresses then have the same ZIP code to allow for easier and more efficient sorting and labeling of mail destined for said addresses.

But not having a well-define shape is a bit of a problem when you’re trying to map something! I needed shapes if I was going to make the map I had in mind. After doing a bit of research, I knew that some third party vendors will sell you ZIP code boundaries for a pretty penny, a product mostly intended for targeted ad and direct mail entrepreneurs I assume. I’m pretty sure these boundaries aren’t “official” though, whatever that means, and are just a somewhat manufactured boundary lines drawn around some set or subset of addresses for any given ZIP code.

# ZCT Eh?

That’s all well and good but I didn’t really want to pay anybody for data to make this map, so where do I turn? That’s right, the Census Bureau. Turns out, the Census Bureau also had this problem at some point and decided to introduce ZIP Code Tabulation Areas (or ZCTA’s) back in 2000.

“The Census Bureau first examined all of the addresses within each census block to define the list of ZIP Codes by block. Next, the most frequently occurring ZIP Code within each block was assigned to the entire census block as a preliminary ZCTA code. After all of the census blocks with addresses were assigned a preliminary ZCTA code, blocks were aggregated by code to create larger areas.” - Census Bureau/ZCTA’s

Luckily for me, ZCTA’s are exactly what I need. They’re both (roughly) representative of actual US Postal Service ZIP Codes while also having well-defined geographic boundaries that are based on census blocks. Perfect!

The only drawback to using ZCTA’s in place of ZIP Codes, ignoring some minor inaccuracies, is that there’s only about an 80% overlap between ZCTA’s and ZIP Codes. While there are almost 42,000 ZIP Codes in the United States, there are only 33,000 or so ZCTA’s. ZIP Codes with very few if any addresses don’t get aggregated as ZCTA’s in the Census Bureau’s process, hence the only 80% overlap.

# Census Data

ZCTA data from the 2010 census can be found here as a product of the Census Bureau’s 2019 TIGER/LineĀ® Shapefile data set. This data comes as a ~500 MB zipped shapefile that includes ZCTA’ for the entire United States.

Now that we have some data, let’s use OGR/GDAL to do some data manipulation.

gdalinfo --version
GDAL 2.4.1, released 2019/03/15

First things first, we need to double-check the projection of the shapefile using ogrinfo.

ogrinfo -so tl_2019_us_zcta510/tl_2019_us_zcta510.shp >> info.txt

It looks like the shapefile data is in a NAD 83 (EPSG:4269) projection, but we want it in WGS 84 (EPSG:4326) and we also want to convert it into GeoJSON at the same time.

ogr2ogr -f "GeoJSON" zcta2019.geojson tl_2019_us_zcta510/tl_2019_us_zcta510.shp -s_srs EPSG:4269 -t_srs EPSG:4326

Double-checking the output GeoJSON file, I was a little surprised to see an odd-looking CRS property (“urn:ogc:def:crs:OGC:1.3:CRS84”) after this conversion. There shouldn’t even be a ‘crs’ property according to the current GeoJSON specification, but after skimming this related GDAL issue on GitHub, I think it’s fine to ignore it for now.

Now we can count how many features we have in the GeoJSON file using the wonderful jq CLI.

jq '.features | length' zcta2019.geojson

It looks like there are 33,144 features in the feature collection, which is about what we expect for the right number of ZCTA’s.

# Map Tiles

The plan is to host the map in Mapbox Studio. But we can’t upload the GeoJSON file directly to Mapbox Studio because it’s too large (~500 MB or so) for their file transfer limit of 5 MB for a GeoJSON dataset. There’s a much higher limit of 25 GB for MBTile files though, so we’ll need to go ahead and convert the GeoJSON file into an MBTile file. For that, let’s use tippecanoe.

tippecanoe \
-zg \
-o zcta2019.mbtiles \
--drop-densest-as-needed \
--detect-shared-borders \
--extend-zooms-if-still-dropping \
--detect-shared-borders \
--simplification=10 \
--include=ZCTA5CE10 \
zcta2019.geojson -f

It took me a couple of tries to get these settings how I want them. Tippecanoe is admittedly like black magic sometimes, so I usually have to iterate through a couple of different options before the tiles start to look good. The biggest issue I usually have with large data sets is with features getting dropped at various zoom levels. These options worked really well and I’ll probably try to use them again in other projects. One thing to note is that we can selectively include only the ‘ZCTA5CE10’ property (the actual 5-digit ZCTA code) of every feature in order to save a little bit on output MBTile file size.

I stumbled upon a really helpful tool in this process: mbview. It lets you view your MBTiles locally and can really help when you’re trying to iteratively tweak tippecanoe settings and quickly see the results.

# Mapbox GL JS

Now that we have an MBTile file and everything’s looking good when we view it in mbview, it’s time to upload the vector tiles to Mapbox Studio. All we need to do is upload our MBTile file as a new tileset, copy down the tileset ID after it processes and then we’re ready to go.

For this map idea, we’ll just do some dynamic styling of vector features, namely the ZCTA boundaries. Switching over to an index.html file, we start with a simple Mapbox GL JS map template. Next, we’ll add some layers using the ZCTA boundary tile set, drawing both lines and fills (as highlights) in different colors. For the initial fill highlighting, let’s filter on just one somewhat special ZCTA (89049) using the JavaScript code below.

map.on('load', function () {
	map.addLayer({
		"id": "zcta-line",
		"type": "line",
		"source": {
			type: 'vector',
			url: 'mapbox://' + sourceZCTA
		},
		"source-layer": layerZCTA,
		'layout': {},
		'paint': {
			'line-color': '#5BC8AC',
		'line-width': 1.0
		}
	});

	map.addLayer({
		"id": "zcta-fill-highlighted",
		"type": "fill",
		"source": {
			type: 'vector',
			url: 'mapbox://' + sourceZCTA
		},
		"source-layer": layerZCTA,
		"paint": {
			"fill-color": "#E6D72A",
			"fill-opacity": 0.5
		},
		"filter": ["in", "ZCTA5CE10", "89049"]
	});
	map.fitBounds(new mapboxgl.LngLatBounds(
		[[-169.44537803261935, 12.125806331322835],[-68.4197336672313, 72.20864483430375]]
	));
})

And that’s it for layers! The magic of this map is really in the dynamic filtering of the “fill-highlighted” layer, using a similar effect to what I did in the San Juan Island parcels map.

To make that magic happen, we need to set an event listener on the text input for the 5-digit ZCTA’s. The event function cleans up the input to only allow numbers and then builds an array of all ZCTA’s that match the digits in that input. For example, if the input is only ‘1’, it returns all ZCTA’s that start with ‘1’. If the input is ‘123’, it returns all ZCTA’s that start with ‘123’, and so on. This array of matching ZCTA’s is then fed into an expressions filter that populate the “fill-highlighted” layer with only those ZCTA features in the matching array.

map.setFilter('zcta-fill-highlighted',
	["match", 
	['get', 'ZCTA5CE10'],
	matchingZCTA,
	true, false
]);

The last step in the effect is to set the bounds of the map so it nicely frames all of the ZCTA’s that have been highlighted. To do that, we can calculate the bounding box for all of the ZCTA’s features (i.e. polygons) in the matching set, and then fit the bounds of the map to this calculated bounding box, covering all ZCTA’s.

map.fitBounds(new mapboxgl.LngLatBounds(
	[[bounds.xMin, bounds.yMin], [bounds.xMax, bounds.yMax]]
));

# The Map

Check out the results here and look up some different ZCTA codes!

Overall, I’m pretty happy with the simplicity of the map, and it’s exactly what I had pictured in my head. I’ve been using it to explore my old ZIP codes to get a better idea of what those boundaries look like, helping myself get more familiar with the structure of ZIP codes and ZCTA’s across the country. It’s a bit minimalist, but it works for now!

# A Bit Sluggish

There were two big things I noticed right away though. One was that the map was pretty sluggish responding to new ZCTA inputs, hanging for a brief moment before zooming in or out to new bounds. The second issue was that the fill highlighting worked fine when you typed in ZCTA digits, but as soon as you started deleting digits, going backwards towards larger and larger sets of matching ZCTA’s (e.g. inputs of ‘123’, then ‘12’, then ‘1’), the highlighted polygons wouldn’t always include all of the ZCTA features that were expected for that input.

After thinking about it for a bit, I realized these two issues were directly related. For one, the browser was taking a performance hit having to compute bounding boxes for every ZCTA polygon in a matching set, running every time the ZCTA input changed. With only a single 5-digit input (aka only one ZCTA), it wasn’t a problem. But with 1-digit inputs (aka all ZCTA’s that start with that digit), this calculation could include on the order of a couple thousand features. It was doing a brute-force iteration through every feature, expanding the overall bounding box with the min/max points from any given feature. Not great.

The second issue is one that I remembered from the San Juan Island parcels map I made. When you set a fill highlight on a layer, Mapbox GL JS can only “see” features that are in the map viewport and therefore an unintentional subset of the features are actually filtered. So if you have features that you try to filter (e.g. to set a fill highlight) but those features are outside of the map bounds when you set that filter, those features won’t be fill highlighted. In this case, the issue pops up when you try to work backwards on the number of input digits, since the map is zoomed in close on a small set of features but the filter is trying to “see” features outside of that bound.

In order to get a new bounding box of this expanded set of features (aka matching ZCTA’s), I first had to get the feature set of matching ZCTA codes from one of the vector layers. Both the “line” and “fill-highlighted” layers contain these data, here we’re extracting features from the “line” layer.

var features = map.querySourceFeatures('zcta-line', {
	'sourceLayer': 'zcta2019',
	filter: ["match", 
		['get', 'ZCTA5CE10'],
		matchingZCTAs,
		true, false
	]
});

And here lies the problem. The querySourceFeatures method only returns features within the map viewport, so it’s a bit of a chicken and an egg problem. To get a bounding box, we need the features. But to filter on all of the features (to set the fill highlights), we need to set a bounding box so the map viewport “sees” all of the features in the first place.

One option to fix this issue would be to ensure that we always have visibility of every feature, regardless of whether or not they’re in the map viewport. We could load the original GeoJSON file that contains all of our ~33k ZCTA polygon’s and that we used to build our vector tiles. Then, we could filter the full feature collection to match the ZCTA codes to the input digits. There are two problems with this though. First, the GeoJSON file is quite large, about 1.8Gb with all of the properties included. We could minimize this file a little bit, but that’s still a huge burden to put on our client’s connection by asking it to load such a big file. Second, the client would still be taking a performance hit in having to both filter features (via regex string matching on ZCTA codes) and calculating bounding boxes over and over again.

But we have another option. What if we could pre-compute bounding boxes for every feature?

# Do The Work Ahead of Time

If we can pre-computer a bounding box for every possible input permutation and store it in a key:value dictionary, we could solve both of our problems here. First, we could eliminate both the need to computer bounding boxes on the fly and the need to do any string-match filtering. Second, we would know a head of time what the bounding box would be for any valid input of ZCTA digits, which we could use to set our map bounds to ensure that our map viewport can first “see” all of the features before they’re filtered. But the questions are (i) can we calculate all bounding boxes for all ZCTA input permutations and (ii) will the resulting file of these bounding boxes be reasonably small enough?

Luckily, our set of ZCTA input digit permutations is fairly well constrained. The ZCTA input allows all combinations of the numbers 0-9 as a 1- to 5-digit input, i.e. everything from “0” to “99999” and everything in between. The calculation of all possible permutations then is simple:

10^1 + 10^2 + 10^2 + 10^3 + 10^4 + 10^5 = 111,110

That…doesn’t seem so bad right? We just need to a way to generate 111,110 different bounding boxes to cover all of these possible permutations of ZCTA digits. There are lots of different ways to accomplish this of course. Since we’ve been on a roll using mostly all CLI tools up to this point, let’s stick to that theme (kinda) and do this all in a Bash script. Cause why not?

We can break this up into a two-part process. First, we’ll just iterate through all 1- to 5-digit permutations, filtering our main GeoJSON file (with all ~33k features) by ZCTA codes that match each permutation, and then saving those resulting feature collections to disk as a bunch of new GeoJSON files.

Since we’re talking about filtering JSON (even GeoJSON) files, let’s use jq again for this. Luckily, jq includes a handy method called ‘test’ which we can use to perform a regex operation on the ZCTA5CE10 property that we want to match with our permutation.

jq -rc '[.features[] | select(.properties.ZCTA5CE10 | test("^123";"i"))] zcta2019.geojson

The output of this should be a GeoJSON feature, or null if no matching ZCTA is found. Now instead of just matching on “123”, we need to iterate through all of those 111,110 permutations of 1- to 5-digit numbers. However, we want to avoid the I/O overhead of loading this big GeoJSON every time, so every time we run this jq filter we should instead run it on the GeoJSON file we just saved for the previous permutation. For example, if we run a jq filter to match on the digit “1” using the full GeoJSON file (zcta2019.geojson), we can save a GeoJSON file with these matching features as D1.geojson. Then, when we run a filter to match the digits “12”, we use the D1.geojson file instead which will naturally be smaller and have fewer features to filter against as compared to the starting zcta2019.geojson file. We can save a good bit of I/O time this way. Either way, this will take a bit of time. But in the end, we’ll have 111,110 GeoJSON files on disk totalling about X Gb.

Now that we have feature collections for every permutation of ZCTA digits, we need to compute bounding boxes for each collection. For that, we can use geojson-extent, yet another fantastic Mapbox CLI tool. It takes in any valid GeoJSON object and returns a bounding box for that object in the form [minLongitude, minLatitude, maxLongitude, maxLatitude].

$ geojson-extent < D55040.geojson
[-93.450745,45.41354199905688,-93.124058,45.53697499905696]

We’re going to use this tool to create a key-value JSON file that contains a bounding box for every 1- to 5-digit ZCTA permutation. We just have to iterate through all 111,110 GeoJSON files we just created in the earlier step, calling geosjon-extent on each file.

We can then build up a new JSON file zcta2019_bbox.json,adding new lines that consist of keys (digits) with values (bounding box arrays) for every ZCTA permutation, ignoring any null bounding boxes if we don’t have any features for a given permutation. This results in a JSON file that’s only ~2 MB and contains 40,121 ZCTA digit permutations with their corresponding bounding box arrays.

{
	"0":[-75.559566,17.673931,-64.565193,47.459833],
	"00":[-67.951798,17.673931,-64.565193,18.518707],
	"006":[-67.951798,17.92437,-66.252504,18.518707],
	"0060":[-67.240437,18.110649,-66.659293,18.513195],
	"00601":[-66.836366,18.111929,-66.659293,18.250344],
	"00602":[-67.240437,18.312691,-67.126434,18.417478],
	"00603":[-67.16943,18.385925,-67.056404,18.513195],
	...
}

Now we just need to load this bounding box data in our HTML file so we can use it in our Mapbox GL JS functions.

<script src="zcta2019_bbox.js"></script>

I know this is a bit of cheat for loading JSON data in a web page, but it’s quick and dirty and works for now!

But now that we have a bounding box dictionary available in our map, we can set the map bounds directly using only the input digits. We’ll add a small bit of padding around the bounding box, mostly just so the map view looks a little better and doesn’t clip any ZCTA polygon edges. And last, we’ll run our expressions filter on the “fill-highlighted” layer to highlight all ZCTA features that match our input digits.

	var b = bbox[input];
	var padding = 0.01;
	map.fitBounds(new mapboxgl.LngLatBounds(
		[[b[0]-padding, b[1]-padding],[b[2]+padding, b[3]+padding]]
	));
	map.setFilter('zcta-fill-highlighted',
		["match", 
		['get', 'ZCTA5CE10'],
		matchingZCTA,
		true, false
	]);

# A Better Map

Since we can now set the map bounds directly from the input digits, we’ve eliminated any computational overhead of calculating bounding boxes on the fly and we’ve ensured that the map viewport covers all possible features before we filter. Aside from the (not inconsequential) hit we take from having the client load an extra ~2 MB file, this is a pretty good solution and the overall feel of the map/input experience is much, much smoother. Not bad!

# Some Possible Updates

Here are some things I would update if I were to put more time into this map.

# Better Colors

With the right 10-color palette, the map would look a lot nicer if every feature (at least initially) was filled with a color according to it’s 1st digit, aka grouped by the ZIP code “national areas”.

# Dynamic Labelling

I think it’d be neat to include dynamic, progressing labelling of ZCTA digits at every polygon centroid depending on the input. With each input, every highlighted polygons are labeled with their remaining digits.

As an example:

input: 9 >> labels: 8002, 8003, 8004,etc

input: 98 >> labels: 002, 003, 004, etc

input: 980 >> labels: 02, 03, 04, etc

input: 9800 >> labels: 2, 3, 4, etc

# Minimize Bounding Box File

The bounding box file size could probably by minimized by limiting the number of digits after the decimal for every bounding box coordinate (e.g. 42.15758 becomes 42.157). This would reduce coordinate precision but for the use of these bounding boxes is pretty rough so we’re not too worried about precision, as long as a bounding box doesn’t clip any feature polygons. But it might be worth it to save a a couple hundred kB’s.

# Or Just Don’t Use A File

Of course, not using a hacky JSON file loading method is probably a better idea. We could upload our ZCTA polygons to a PostGIS instance, calculate a separate table of bounding boxes and then expose that data as a web API. We would add the overhead of a GET request for every input change to get a bounding box, but for a lot of reasons this would almost definitely be a worth while trade.

# Inspirational ZIP Code Maps

I happened to find a couple of interesting ZIP code maps after I put this map together, but I wanted to include them because they’ve served as “after-the-fact” inspiration and make me want to continue building out more features for the map.

zipdecode by Ben Fry…a very similar effect, but uses point data (possibly post office locations?) rather than polygons; includes labeling too!

ZIPScrabble by Robert Kosara…a visualization connecting all US ZIP codes in ascending order creating some interesting patterns.