A few days ago, I had a task to extract addresses from unstructured text, like the following:
Hey man! Joe lives here: 44 West 22nd Street, New York, NY 12345. Can you contact him now? If you need any help, call me on 12345678
The text in bold green must be extracted from sentence and returned as address string. From the first view it seems not so hard to do this using regular expressions, but when actually trying to do this, you can find out that the regular expression monster growing every moment and the precision of recognized address string is staying the same. When you have semi-structured text where you can match some “labels” where address begins, then regular expression is a way to go. It’s fast, you don’t need huge dataset of addresses to train address chunker (more on that latter), all you need is predefined regular expression and than tune it to the cases where it fails.
So, as regular expression is off the table, the other option is to use Natural Language Processing to process text and extract addresses. After spending some time getting familiar with NLP, it turns out it was the way I was thinking about this problem in the first place. Not as a rules you defined beforehand (regular expression) but as the classifier that is able to predict the class of chunk of text based on the previous observation. Think of it as training on same piece of text shown above, but with marked that this part of text is address and others is just a noise.
Using NLP for address extraction
If you are not familiar with NLP term and never done anything with it in Python, I suggest to get a brief introduction in Natural Language Processing with Python book. This is a great resource to dive into NLP field using NLTK toolkit which is written in Python and contains large number of examples.
Now, in order to train our classifier we need dataset of tagged sentences in IOB format. This means that the sentence above must be tagged in a way to show classifier where address begins, continue and ends.
Let’s see it in terms of Python list:
[('Hey', 'O'), ('man', 'O'), ('!', 'O'), ('Joe', 'O'), ('lives', 'O'), ('here', 'O'), (':', 'O'), ('44', 'B-GPE'), ('West', 'I-GPE'), ('22nd', 'I-GPE'), ('Street', 'I-GPE'), (',', 'I-GPE'), ('New', 'I-GPE'), ('York', 'I-GPE'), (',', 'I-GPE'), ('NY', 'I-GPE'), ('12345', 'I-GPE'), ('.', 'O'), ('Can', 'O'), ('you', 'O'), ('contact', 'O'), ('him', 'O'), ('now', 'O'), ('?', 'O'), ('If', 'O'), ('you', 'O'), ('need', 'O'), ('any', 'O'), ('help', 'O'), (',', 'O'), ('call', 'O'), ('me', 'O'), ('on', 'O'), ('12345678', 'O')]
List contains number of tuples where the first element is a word and second is IOB tag:
O – Outside of address;
B-GPE – Begin of address string;
I-GPE – Inside address string;
Using this dataset and feature extraction method, we show classifier what chunk of text we want to extract and provide a way (feature detection method) to “map” feature to IOB tags. This is very simplified version of what is going on under the hood of classifier, if you need more details, in this post is used ClassifierBasedTagger which based on Naive Bayes classifier.
Where to get dataset?
In the repository you can find already compiled dataset of texts with US addresses. This is pickled Python list which contains more than 9000 IOB tagged sentences.
This list was compiled using different methods: Getting existing dataset of hotel/pizza contact info web pages; Generating fake text with fake addresses; Inserting random addresses to nltk.corpus.treebank corpus;
One idea that I haven’t tried is to scrape web site, which is using structured data and automatically retrieve addresses marked with address tag. So, this is already structured data, the only thing is left is to convert it to IOB format and add to dataset.
The source code of chunker is pretty straightforward, all you need to do is get brief introduction of NLP concepts and you ready to go.
But what about results? There are cases when the part of text which is not address tagged as it is. The problem is in different formats of addresses and that’s only US addresses…One way to remove this ambiguity and to push accuracy of address recognition to 100% is to use USPS database with combination of regular expression and dictionary of named geographic places including names, types, locations (gazetteer), etc.