Extracting addresses from text

A few days ago, I had a task to extract addresses from unstructured text, like the following:

Hey man! Joe lives here: 44 West 22nd Street, New York, NY 12345. Can you contact him now? If you need any help, call me on 12345678

The text in bold green must be extracted from sentence and returned as address string. From the first view it seems not so hard to do this using regular expressions, but when actually trying to do this, you can find out that the regular expression monster growing every moment and the precision of recognized address string is staying the same. When you have semi-structured text where you can match some “labels” where address begins, then regular expression is a way to go. It’s fast, you don’t need huge dataset of addresses to train address chunker (more on that latter), all you need is predefined regular expression and than tune it to the cases where it fails.

So, as regular expression is off the table, the other option is to use Natural Language Processing  to process text and extract addresses. After spending some time getting familiar with NLP, it turns out it was the way I was thinking about this problem in the first place. Not as a rules you defined beforehand (regular expression) but as the classifier that is able to predict the class of chunk of text based on the previous observation. Think of it as training on same piece of text shown above, but with marked that this part of text is address and others is just a noise.

Using NLP for address extraction

If you are not familiar with NLP term and never done anything with it in Python, I suggest to get a brief introduction in Natural Language Processing with Python book. This is a great resource to dive into NLP field using NLTK toolkit which is written in Python and contains large number of examples.

Now, in order to train our classifier we need dataset of tagged sentences in IOB format. This means that the sentence above must be tagged in a way to show classifier where address begins, continue and ends.

Let’s see it in terms of Python list:

[('Hey', 'O'), ('man', 'O'), ('!', 'O'), ('Joe', 'O'), ('lives', 'O'), ('here', 'O'), (':', 'O'), ('44', 'B-GPE'), ('West', 'I-GPE'), ('22nd', 'I-GPE'), ('Street', 'I-GPE'), (',', 'I-GPE'), ('New', 'I-GPE'), ('York', 'I-GPE'), (',', 'I-GPE'), ('NY', 'I-GPE'), ('12345', 'I-GPE'), ('.', 'O'), ('Can', 'O'), ('you', 'O'), ('contact', 'O'), ('him', 'O'), ('now', 'O'), ('?', 'O'), ('If', 'O'), ('you', 'O'), ('need', 'O'), ('any', 'O'), ('help', 'O'), (',', 'O'), ('call', 'O'), ('me', 'O'), ('on', 'O'), ('12345678', 'O')]

List contains number of tuples where the first element is a word and second is IOB tag:

O – Outside of address;

B-GPE – Begin of address string;

I-GPE – Inside address string;

Using this dataset and feature extraction method, we show classifier what chunk of text we want to extract and provide a way (feature detection method) to “map” feature to IOB tags.  This is very simplified version of what is going on under the hood of classifier, if you need more details, in this post is used ClassifierBasedTagger which based on Naive Bayes classifier.

Where to get dataset?

In the repository you can find already compiled dataset of texts with US addresses. This is pickled Python list which contains more than 9000 IOB tagged sentences.

This list was compiled using different methods: Getting existing dataset of hotel/pizza contact info web pages; Generating fake text with fake addresses; Inserting random addresses to nltk.corpus.treebank corpus;

One idea that I haven’t tried is to scrape web site, which is using structured data and automatically retrieve addresses marked with address tag. So, this is already structured data, the only thing is left is to convert it to IOB format and add to dataset.


The source code of chunker is pretty straightforward, all you need to do is get brief introduction of NLP concepts and you ready to go.

But what about results? There are cases when the part of text which is not address tagged as it is. The problem is in different formats of addresses and that’s only US addresses…One way to remove this ambiguity and to push accuracy of address recognition to 100% is to use USPS database with combination of regular expression and dictionary of named geographic places including names, types, locations (gazetteer), etc.

Build a Reddit bot in Python for similar posts suggestion.


RSPBot – source code.

In this post we are going to build Reddit bot for related posts suggestion. Think of it as “related” page for posts in subreddit. The idea is pretty simple: when someone creates a new post – the RSPBot will reply with a bunch of similar posts (if available).

Here is example for r/smallbusiness:

Original post:

Getting my small business bank account today. Should I go with a small credit union or a large bank?

Related posts:

Best small business bank account?

Bank recommendations for small business.

Best Small Business Bank

Which bank do you use for you small business and why?

Recommendations for Small Business Bank

Actually it’s a good way to measure how important this topic to the particular subreddit audience, what problem they are trying to solve and in what way.

But before moving to another topic I suggest to take a look at etiquette for reddit bots. This is a set of rules and suggestions what to do and what not with your bot. Don’t ignore those rules as banned bot is a dead sad bot 🙂

Scraping Reddit

Let’s think about the following workflow: rspbot monitor subreddit for new post submission, then it extracts post title and perform search for similar posts in the same subreddit, then reply with a list of related posts. Sounds as easy task, but think about the active subreddits where you receive large number of new submissions and then trying to search for post title, there is a chance that after some time you just flood Reddit with search API query.

What if we just scrape subreddit titles beforehand and make it as our database of post titles and create a mechanism to search though them and then reply with bunch of similar topics. Sounds great as it’s simplifies our workflow, so we could perform all post titles matching on our local machine.

Scraping reddit posts is pretty simple as there is a great API documentation, but we will use Python API wrapper –  PRAW, as it’s encapsulate all API query and provide easy to use programming interface. But what’s more important is that PRAW will split up your large request to multiple API calls each separated with some delay in order to not break Reddit API guidelines.

Here is excerpt from subreddit scraper source code:

def add_comment_tree(root_comment, all_comments):
    comment_prop = {'body': root_comment.body,
                    'ups': root_comment.ups}
    if root_comment.replies:
        comment_prop['comments'] = list()
        for reply in root_comment.replies:
            add_comment_tree(reply, comment_prop['comments'])

def get_submission_output(submission):
    return {
            'permalink': submission.permalink,
            'title': submission.title,
            'created': submission.created,
            'url': submission.url,
            'body': submission.selftext,
            'ups': submission.ups,
            'comments': list()

def save_submission(output, submission_id, output_path):
    # flush to file  with submission id as name
    out_file = os.path.join(output_path, submission_id + ".json")
    with open(out_file, "w") as fp:
        json.dump(output, fp)

def parse_subreddit(subreddit, output_path, include_comments=True):
    reddit = praw.Reddit(user_agent=auth['user_agent'], client_id=auth['client_id'],
    subreddit = reddit.subreddit(subreddit)
    submissions = subreddit.submissions()
    for submission in submissions:
        print("Working on ... ", submission.title)
        output = get_submission_output(submission)
        if include_comments:
            for comment in submission.comments:
                add_comment_tree(comment, output['comments'])
        # flush to file  with submission id as name
        save_submission(output, submission.id, output_path)

if __name__ == '__main__':
    parse_subreddit("smallbusiness", "/tmp/smallbusiness/", include_comments=False)<span 				data-mce-type="bookmark" 				id="mce_SELREST_start" 				data-mce-style="overflow:hidden;line-height:0" 				style="overflow:hidden;line-height:0" 			></span>

parse_subreddit method iterates over all posts in subreddit and extract all information defined in get_submission_output method with optional comments list and saves it to JSON file with file name as post ID.

If you don’t need comments, than just set include_comments to False as it speed up scraper significantly.

Make sure you have created app on Reddit and received your client ID and client secret.

Content-based recommendation engine

So, we have subreddit posts as list of JSON files with all information we need, now we need to build a mechanism for searching though files and extract similar posts as recommendation for user.

In rspbot we are converting text documents to a matrix of token occurrences and then use kernels as measures of similarity. So, basically we will use scikit-learn library class HashingVectorizer with linear_kernel method to get similarity matrix.

After we transformed list of post titles to a matrix of numbers we can save this matrix as binary file and then when running rspbot – just load this file and use it for getting similar posts. This will improve performance as we don’t need to parse the whole list of posts from JSON files to build matrix representation, we just load it from file and re-use from previous run.

Make sure you have basic understanding of how HashingVectorizer and linear_kernel works (check out scikit-learn tutorial on basic concepts) before moving to module source code.

Monitoring Reddit for new posts

Now we have set of post titles as JSON files and mechanism for matching similar posts, then next step would be to make a bot to listen for new submissions, get all related post if available and then reply with list of suggested topics.

With PRAW monitoring for new submissions is easy as writing for loop:

subreddit = reddit.subreddit('AskReddit')

for submission in subreddit.stream.submissions():

# do something with submission

Let’s summarize the whole bot “life” in a list of steps:

  1. Scrape subreddit to list of JSON files as related post suggestion database.
  2. Convert the list of JSON files to matrix of token occurrence for finding related posts. This step is performed only for the first time the bot is started, latter we just re-use existing matrix saved as binary file.
  3. Monitor for new submissions; in case of new submission – convert to matrix of token occurrence and combine it with the “old” matrix; search for related posts – if found – display list of titles with URL’s.
  4. Profit! 🙂


Top 20 Store Locator Design

Store Locator

Everyone who is trying to buy something online or tried to find direction to store, noticed a small link on a store website like: “Where to buy”, “Locate a ..”, “Find a store”, “Find us” and many others interpretation of the same simple concept of getting customer to buy from your business.

This is simple but at the same time very powerful way to engage with potential customer and getting them though all steps to buy from you, starting from finding your website, find a product in specific category and finally getting direction to supplier location.

Maps are everywhere

I remember the time when I was trying to buy power bank while traveling to Berlin. I typed in search box things I needed to buy and follow a web site with list of products I was searching for. Then I clicked on a store locator and there was just a plain list with street names and zip codes of store locations.

Guess what? I closed browser tab and headed to another website with the similar products and then I saw the same plain dump of addresses which I needed copy-past to map search box and try to figure out what store location is the closest to my location. Iterating over the next couple websiteі I found the one which provides detailed map of where to buy, how to get over there from my location and even delivery to hotel. That’s was the place I bought product even their website was not in top 10 on Google search results.

What can you learn from this? Attention to small details like this can benefit your business in significant way. You see all those stores that were sorted out even they were first in the list? And the choice fell to the one with best customer user experience.

I like maps and want to show you how well known companies use “store locator” to drive customers to there stores and what can you learn from them.

#1 Store Locator: Apple

Apple Store Locator

Apple store locator page is lightweight in terms of user experience. There are three key elements on page: search box, store list and map view. The best matched store in location you typed appears as pop up window on a map with store photo and detailed street address & phone number.

Store list on the left side is abbreviated with letters in alphabetical order. It’s easier to remember: store C in Boston MA and latter scroll list to match C letter than hold in memory: Apple CambridgeSide, 100 CambridgeSide Place, Cambridge, MA 02141.

Going to store details:

Apple Store Details

Photos of store with detailed street address & phone number and opening hours. Going to Driving directions and map:

Apple Store Directions

Now, there is a full map with branded marker and “How to get here:” section. I like this one, it’s like your friend telling you in simple words how get there, what exit to get if you are getting there by car or public transit and where to park.

#2 Store Locator: Sprint

Sprint Store Locator

Sprint is one of the largest mobile network operator in US, so it provides filtered stores by Repairs, Sales, Bill payments, etc. As you probably noticed – marker on a map in different shapes to differentiate between regular and repair store.

Sprint Store Details

Click on particular item in a list shows you list of services, opening hours, scheduling appointment and one nice feature is “Text to my phone or email”  which is very convenient when you don’t have time write things down.

Sprint Appointment

If you an existing customer you can click on “Make appointment” then fill out needed information and get into the queue without getting on the phone.

Sprint Directions

And finally, driving direction both in map view and text representation, emphasizing in bold style direction and street name.

#3 Store Locator: Tesco

Tesco Store Locator

Tesco store locator is very informative in terms of services, accessibility, facilities, opening hours (see that Open Now green text on white background). Opening hours section divided by Cafe/Phone which is nice attention to details for your customer.

#4 Store Locator: ALDI

ALDI Store Locator

ALDI shows store details in pop-up window on a map with payments, hours, parking availability and Weekly Ad. Actually that’s a good way to get customer attention to show them deals of the week in particular store.

#5 Store Locator: Target


Target Store Locator

Target store locator design is clean and light, it’s not loaded with banners and stuff. There is no map when you typed address in search box but there is a pretty clean-looking list of stores, tiled like list with phone number and opening hours.


Target Store Info

On the right side there is a list of nearby stores near this one, so you can select another one which is working to the late night.

Target Store Map

Here we have detailed map of stores in different categories which is a good way to find what you need instead of wander around.

Target My Store

You can set current store as my store, so whenever you are trying to search for something on Target website, the my store will be use as default location where you can get your ordered stuff.

Target My Store Search

Here you are, your store is marked as my store.

#6 Store Locator: Walmart

Walmart Store Locator

Walmart stores clustered on map with the detailed list on the left side. You are here is your location retrieved when you allow geolocation in the internet browser.



In store details there is another field for stock search in this store.

Walmart Refine Search

You can refine search on store locator to filter by store features, distance, etc.

Walmart Store Details

Store info shows you weekly ad, store saving and Make this my store.

Walmart My Store

My Store feature to make it fast to search for items in particular store.

#7 Store Locator: Staples

Store Locator Staples
Staples Store Info

Here is the separate section for Store Events which list upcoming classes or educational events.

Staples Store Directions


#8 Store Locator: SainsBurys

SainsBurys Store Locator

Store locator is full page map with search box and Search as I move the map, which updates list of stores when you are navigating map.

SainsBurys Store List
SainsBurys Store Info

Additional to address/phone there is a name of Store manager.

SainsBurys Stores Directions

Map and text view of direction to store using Car/Walk/Cycle/etc.


#9 Store Locator: Lowes

Store Locator Lowes
Lowes Store Details
Lowes My Store


#10 Store Locator: T-Mobile

T-Mobile Store Locator

T-Mobile store locator contains in-store wait feature to get rough estimate how long to wait in a queue.

T-Mobile Store Info

Here is the fancy picture of store and you get in line next in in-store wait label.

#11 Store Locator: J.Crew

J.Crew Store Locator
J.Crew Store Locator Map View

#12 Store Locator: Best Buy

Best Buy Store Locator
Best Buy Store Info

#13 Store Locator: Whole Foods

Whole Foods Store Locator
Whole Foods Store Info
Whole Foods Local Sales

Local sales or flyers or weekly ads is a nice feature in your website to drive customers to your stores to use discounts. Actually there is a Download Sales Flyer as PDF, so you can save it to mobile phone and use later.

#14 Store Locator: Sears

Sears Store Locator
Sears Store Info
Sears Store List by States


#15 Store Locator: Home Depot

Home Depot Store Locator
Home Depot Store Details


There is Upcoming Workshop section which list events customers can take part in.

#16 Store Locator: Starbucks

Starbucks Store Locator
Starbucks Store Details
Starbucks Store Locator Filters

#17 Store Locator: McDonald’s

McDonald’s Store Locator
McDonald’s Store List View

#18 Store Locator: FedEx

FedEx Store Locator
FedEx Filtering Services

#19 Store Locator: 7-Eleven

7-Eleven Store Locator

#20 Store Locator: MTG

Store Locator Mtg
Mtg List of Stores
Mtg Store Details


Here is the brief summary  what you can learn from above companies:

  • Don’t throw list of addresses to customer, keep it easy for them to find your business. Use map with additional list view.
  • Use branded markers. Don’t use default Google maps markers.
  • Differentiate markers based on your store type, is it service? Pick-up point? Regular store? Make it visible on a map.
  • Don’t restrict yourself to put only store location on markers. Store locator is just another way to get customer to buy from you. Add weekly ads, events.
  • Add My Store feature to set you customer default store. That’s not only default store location from your store list, but your competitors as well.
  • Add photos of store to make it more visible and transparent for customer.
  • Add directions to your store on map as well as in text representation. See Apple store locator example.
  • Add opening hours, services, phone numbers, store manager name, etc., and what’s more important is make this information up to date.
  • Enjoy 🙂


Jenkins – Managing multiple jobs parameters

Here is the problem: You have a large number of jobs on multiple Jenkins instances with the same parameter – let it be branch name where to pull source code from. Now you want to rename your branch from “dev” to “dev_core” and update configuration in your development workflow according to new changes.

What to do? At the first glance, plugins is the way to go, but wait…

While there are many Jenkins plugins exists out there to change parameters for multiple jobs at the same time, but not always you have permissions to install plugins or this is prohibited by corporate policy due to security concerns.

So, there is another way to automate this task using Jenkins API and Python, but make sure you have all needed permissions to modify Jenkins job configuration.

Here is the Python script to manage parameters for multiple Jenkins jobs:

Set parameters for multiple Jenkins jobs

import jenkins
import argparse
import configparser
import os
import xml.etree.ElementTree as xml_parser

from urllib.error import HTTPError
from http.client import HTTPException

# Get user credentials for jenkins.
# Create new module: jenkins_credentials.py with the following structure:
# credentials = {"user": "user name", "password": "your password"}

    from jenkins_credentials import credentials
except ImportError:
    credentials = {"user": "", "password": ""}

CONFIG_SECTION = "configuration"
CONFIG_JOBS = "jobs"
CONFIG_PARAMETERS = "parameters"

OLD_PARAM_VALUE = "old_value"
NEW_PARAM_VALUE = "new_value"

def get_config_parser(config_file):
    config_parser = configparser.ConfigParser()
    config_parser.optionxform = str
    return config_parser

def get_job_config(jenkins_instance, job):
        job_config = jenkins_instance.get_job_config(job)
    except HTTPException as e:
        job_config = None
        print("Error occurred while getting job configuration for {} due to {}".format(
            job, e))
    return job_config

def apply_configuration(jenkins_instance, jobs, parameters):
    for job in jobs:
        print("### Working with ", job)
        job_config = get_job_config(jenkins_instance, job)
        if job_config is not None:
            xml_root = xml_parser.fromstring(job_config)
            for name, value in parameters:
                match = './/hudson.model.StringParameterDefinition[name="{}"]'.format(name)
                param_element = xml_root.find(match)
                value_element = None
                if param_element is not None:
                    value_element = tuple(param_element.iterfind("defaultValue"))
                if value_element:
                    value_element, = value_element
                    print("Old value for parameter {} is {}, setting to {}".format(name,
                        value_element.text, value))
                    value_element.text = value
                    print("WARNING: parameter with name {} not found in job {}".format(name, job))
            # Upload configuration changes to Jenkins
            print("Applying configuration for ", job)
                xml_parser.tostring(xml_root, encoding="unicode"))

def parse_configuration(config, section):
    result = {"jenkins_instance": None,
              CONFIG_JOBS: list(),
              CONFIG_PARAMETERS: list()}
    for param, value in config.items(section):
        if param == CONFIG_JENKINS:
                result["jenkins_instance"] = \
                    jenkins.Jenkins(value, credentials["user"], credentials["password"])
            except HTTPError as e:
                result["jenkins_instance"] = None
                print("Error occurred while making connection to {} due to {}".format(
                    value, e))
        elif param == CONFIG_JOBS:
            result[CONFIG_JOBS] = value.split(",")
            # configuration parameter to change
            result[CONFIG_PARAMETERS].append((param, value))
    return result

if __name__ == "__main__":
    usage = "\nSet parameters for multiple Jenkins jobs:\n" \
            "Usage: python jenkins_job_config --config config_file_name \n"
    parser = argparse.ArgumentParser(prog="Jenkins jobs runner", usage=usage)
    parser.add_argument("--config", type=str,
    	default=os.path.join(os.getcwd(), "jenkins_jobs.config"))
    args = parser.parse_args()
    config_reader = get_config_parser(args.config)
    for server_config in config_reader.sections():
        print("@@@ Jenkins server", server_config)
        config = parse_configuration(config_reader, server_config)
        if config["jenkins_instance"] is not None:

Configuration file template:



Matplotlib Tutorial: Plotting most popular iPhone applications size over time

Strange things happens after you have updated some applications on your mobile phone. More than often there is not enough space on device, because some applications are more greedy than others. That’s was the case with my iPhone after regular update of most popular applications: I’ve received message about removing something as there is no space left on device (sigh!).

So, there was an initial question: How application size changes over time? Two years ago there was no problems with the same application, but now it eats up twice the same the size!

One of the best way to understand this kind of information is through plotting data showing relation between two variables: application size and date of update release.

Getting data

Here is the problem: There is no history of application releases in iTunes, only the latest one. So, after quick research the Internet Archive seems like a good option to view history of specific web page over time, in our case it would be iTunes web page for specific application: see Facebook application history as example.

Unfortunately, not all history is being saved for specific web page, for example: GMail history is saved only starting from March 25, 2017, but anyway we can get an idea of application size over short period of time.

Getting history for web page is easy using Internet Archive search end point: https://web.archive.org/cdx/search/cdx?. All you need to do is to pass web page URL to this end point and the returned result is a list of items with time stamp and URL of page for that date.

Here is source code for getting history for most popular iPhone applications. The idea is simple: Iterating over predefined list of applications URL; Getting history of URL for specific application; Parsing application size and release date; Saving data to make it readable by plotter.

Plotting with Matplotlib

Matplotlib is one of the most popular 2D plotting library for Python. If you are familiar with MATLAB, there will be no problem using this library. On the other hand there is a set of tutorials from beginner to advanced user.

The source code for plotting relation between release date and size is pretty straightforward: For each application data file new subplot is created and used for plotting dates. All subplots are located in one figure which can be displayed on the screen or saved in different formats. In our case all applications subplots are saved in result image file using savefig.


Here is the source code for Internet Archive parser and Matplotlib plotter with additional parsed data for most popular iPhone applications.

Bootstrap vs Custom CSS

Recently I have this idea to create a small web app to manage list of tasks with sub-tasks. Coming from system programming field to writing web application and looking to dozen web frameworks there was the obvious question – what to use?

Using Flexbox to layout tasks elements

After some time of research and trying things the choice falls on Bootstrap framework, as it’s actively maintained and community is large in case of trouble getting your code to work. Another option is to use pure CSS + JavaScript to style and write logic to make similar components  included in Bootstrap out of the box and much more.

It is tempting to use first option as you see all building blocks in Bootstrap ready to use and imagine how easy it would be building web app using their layout classes and different components. On the other hand you stay alone with a ton of CSS documentation plus your own implementation with additional third-party libraries, not to mention hacking CSS in order to overcome different kind of issues related to specific version of web browser.

Some time ago I’ve chosen Bootstrap over custom development for another project. So far so good until I hit the wall when I needed to do something that was not built-in into Bootstrap classes and components as well as not already answered on blogs/forums. So, extending already existing classes was a nightmare.

This time I’ve tried to do custom CSS and to learn how to do things in a right way. Here is brief summary what you need to learn in order to have base understanding to move further into styling complex web application with large number of components.

  1. First of all to have introduction what CSS is all about. Sure, you’ve read some CSS and know something about it, but this introduction is more to organize your knowledge about selectors, pseudo-classes and elements, box model, units in order to switch from hacking stuff together and make it work to understanding why it works like this and how to fit it to your needs.
  2. CSS Layout – that’s important and the hard one. If you want to arrange elements on webpage in specific order you need to understand how to do this correctly without hard-coding left/right/top/bottom values. I recommend to take a glance at floats, positioning and then invest some time on deep understanding how Flexbox and Grids work. This is powerful concept to grasp as this will decrease the time spent on figuring out why does this element is not positioned correctly or not showing at all.
  3. Media Queries – you need to think about different devices, it’s not only your desktop web browser window anymore. There are different kinds of output devices and media queries is an easy way to remove/replace/add new styles based on different input parameters.


So, after you grind all documentation along with writing simple web applications in JavaScript + CSS, supporting different devices, fixing bugs in different web browser versions – now, you have understanding how Bootstrap or similar web framework works by encapsulating all functionality and allows you to forget about this for some period of time.

Now, after you understood how this works under the hood, the question what to use is resolved by themselves. If you need it fast – you can use Web frameworks, if you need lightweight solutions – no problem, just use custom approach or combine Web framework with custom CSS and JavaScript, but the important part is that you have understating how it works and how to fix it on your own.

Axis Order – What number comes first longitude or latitude?

What number comes first latitude or longitude? The answer is pretty simple – it depends on what software/library you are using.

Here is compiled list of popular desktop/web/formats/programming libraries and the axis order they are using.

This list was compiled after some time of using those packages and of course fixing bugs when you passing tuple of coordinates in latitude, longitude format to package which is using (surprise!) longitude, latitude order, that’s was a fun night!

Latitude, Longitude Longitude, Latitude
Leaflet GeoJSON
Google Maps API KML
Apple MapKit Shapefile
Bing Maps API WKT
OpenStreetMap WKB
HERE maps OpenLayers
Mapbox API
MySQL spatial
Oracle spatial

In case you found something wrong or want to add new entry to the table above, just leave comment and I fix/add entry.