sterlingdesign / address-normalizer
Classes used to parse and normalize physical and mailing addresses
Installs: 2
Dependents: 0
Suggesters: 0
Security: 0
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
pkg:composer/sterlingdesign/address-normalizer
Requires
- php: >=8.1
- ext-dom: *
- ext-libxml: *
- ext-mbstring: *
- sterlingdesign/dom: ^2.0.1
README
This is a library written in PHP that can be used to read physical and mailing addresses and normalize them.
Motivation
It turns out that automating work with addresses can be somewhat complicated. By way of example, here is a list of addresses:
- 70 South Street, Unit A, Fort Wayne, IN, 46809-0200
- 70 South St., Unit A, Ft Wayne, IN, 46809-0200
- 70 South St, Suite A, Fort Wayne, IN, 46809
- 70 S Street, Unit A, Fort Wayne, IN, 46809
- 70A S St, Ft Wayne, IN, 46809-0200
In reality, most people can look at these and understand that they are all the same physical location. However, let's say you have a list of 500,000 addresses, and you need to automate the comparison process. Not so easy, especially considering that the above example only illustrates a few of the most common variations in describing a physical address.
The key to being able to compare addresses using software is to be able to re-write the address into some sort of standardized format: a process of normalization.
Research and Theory
The topic has been well developed in the past by several organizations, notably the Postal Service. Obviously, the software they use to automate the process of machine reading and printing bar codes has been well funded over the years. When I first started looking at the problem, I expected to find some good guidance from USPS sources. There is definitely some good matterial on their portal, much of it geared towards providing guidance for consumers to write addresses that can be read.
However, implementations of software seem to be mostly proprietary due to the fact that mailing and addressing services are a lucrative business for some.
In order to implement any sort of normalization process, the best theory I came across was in an online posting From https://endswithsaurus.wordpress.com/2009/07/23/a-lesson-in-address-storage/
During the course of my research, I’ve found that the most flexible generic format for address data is this: Street Number [Int] Street Number Suffix [VarChar] – A~Z 1/3 1/2 2/3 3/4 etc Street Name [VarChar] Street Type [VarChar] – Street, Road, Place etc. (I’ve found 262 unique street types in the English-speaking world so far… and still finding them) Street Direction [VarChar] – N, NE, E, SE, S, SW, W, NW Address Type [VarChar] – For example Apartment, Suite, Office, Floor, Building etc. Address Type Identifier [VarChar] – For instance the apartment number, suite, office or floor number or building identifier. Minor Municipality (Village/Hamlet) [VarChar] Major Municipality (Town/City) [VarChar] Governing District (Province, State, County) [VarChar] Postal Area (Postal Code/Zip/Postcode)[VarChar] Country [VarChar]
After doing my own testing and homework, there are is an additional component to the structure proposed by the above author. The article above seems to describe physical addresses only. We also need to deal with delivery addresses.
So it seems there are 4 types of addresses:
- PHYSICAL - As above
- POBOX - The only acceptable format for this address line is "PO BOX ##"
- RR - the address line must contain RR ## BOX ## (SECONDARY_DESCRIPTOR SECONDARY_ID)*
- MILITARY - These seem to be formatted in any old way with little consistency. The USPS has formatting requirements that are rarely adhered to.
This Implementation
Durring development and testing, most of the data was for US addresses. Many international addresses have characteristics, such as containing multiple minor municipalities, that are not fully handled.
The current 1.x version I've made available here works 99% of the time for my needs. It could stand more work, especially with code clean-up and interface solidification.