Zip Codes

Zip codes seem like a good explanatory field since they represent geographical location, but their are problems with their use:

There are a large number of zip codes that will slow modeling.
Zip codes that are missing from your training data will not be included as explanatory factors even if they should be.

Here are strategies for modeling zip codes.

Bin Zip Codes by Characteristics

Attach a Score to each Zip Code that groups like Zip Codes into bins. If, for example, all wealthy zip codes are equally representative of wealth, then they all should be ranked together whether or not a member of the target population happened to be in a particular Zip Code. For example, in a fundraising dataset, if a large donor resides in Greenwich, CT causes that code to be ranked high, but there are no cases living in Winnetka, IL, this will cause that zip code to be ranked low. This would mean people living in Winnetka would get a low wealth score just because nobody in the target population had lived there so far.

Either a numeric score or a categorical grouping would be attached to each Zip Code and used in the model. If, for example, the "A-Wealthiest" group scored high because it was more highly represented in the Target than "D-Midlevel Wealth", every community in that group would receive the same high score in the model calculation.

Two sources we use for this data:

Forbes Wealth Zips: flags the 500 wealthiest zips in the US.
IRS stats on wealth, population density, etc. This is more comprehensive in that every zip in US gets scored; it also can be used to flag urban (high density) vs. rural (low density). This can be useful because, for example, urban low income communities are actually quite different from rural low income communities in many ways. You can get IRS zip code data from http://www.irs.gov/uac/SOI-Tax-Stats-Individual-Income-Tax-Statistics-ZIP-Code-Data-(SOI)

To use this data in ADVIZOR:

Load the additional table(s) into the project. If the project is set to reload data, then this new data will need to be on an accessible network drive
Copy the relevant scores from the new table to the existing entity table in the project using the Table Link wizard, using Zip Code as the common key.
- Be sure that zip codes in both table are either both integers, or both strings so that the fields will match. You can use the expression builder to match type if this is needed.
- Make sure that the encoding is consistently 5 or 9 digit codes. If there is a mixture, convert them all to 5 digit using the Expression Builder.

Group Zip Codes

Group the Zip Codes with less than 25 members in the Base Population into an "Insufficient Datai" bin. The model can be run against the Zip Codes themselves as long as the Zip Codes with low participation are first binned into a "Insufficient Data" bin. Conceptually a categorical field like a Zip Code should have at least 25 members of the Base Population in that Zip Code. With a population of 0 or 1 any scoring is essentially random. By the time the membership reaches 10 the non-systematic risk is reduced. By 25 it is largely eliminated. You still will need to be careful of classification models with a small target population relative to the base population. If that percentage is, say, 1% then you should raise this number to maybe 100 per category.

Assuming a standard forecast model, or a classification model with Target/Base in the 5%+ range, a good strategy is to bin every Zip Code with less than 25 members in the Target into an "Insufficient Data" bin, and then the model will run against the remaining Zip Codes. With this method you will learn about the statistically relevant Zip Codes, but you will not learn about the Zip Codes in the "Insufficient Data" bin. It is possible that there are members in this bin that should be scored high, but are not just because there is insufficient data in that Zip Code.

You should use this approach if you believed there were unique and possibly qualitative aspects of these various communities that cannot be adequately represented by any of the scores as described in the previous method.

To do this in ADVZIZOR:

Determine your Base Population for the model.
Add a column to the entity table labeled "OneCount" with a value of "1" for each member that is in the base population using the Expression Builder.
Roll up the entity table on Zip Code, and Sum "OneCount". This will tell you how many base population entities are in each zip code.
Copy "OneCount" from this rollup table back to the entity table using Zip Code as the key.
- Be sure that zip codes in both table are either both integers, or both strings so that the fields will match. You can use the expression builder to match types if needed.
- Make sure that zip codes are consistently 5 or 9 digit codes; if a mix convert them all to 5 digit using the Expression Builder.
Create a new column in the entity table: "ZipCodeForModel" using the expression "if OneCount < 25 then "Insufficient Data" else string(Zip Code)".

Note that the zip codes should be strings so that they are evaluated as discrete items, not a range of numbers.

PreviousDate Fields NextPredictive Modeling Pane

Last updated 6 years ago