Server-side Markers clustering using PostgreSQL + PostGIS and NodeJS
In this post I will discuss clustering of points on server side using combination of PostgreSQL and PostGIS extension. Then there will be presented NodeJS application, which simply pulls data from database using set of latitude and longitude describing interested area.
Why to cluster places?
Suppose you are working on a website which allows everyone to pay some fee to plant a tree in chosen location. For each tree there is a pin on a map with tree marker and the name of generous person who skipped mornings coffee for the sake of restoring forest around the world. Now, imagine for a moment when your website became popular and you have 10.000 of trees planted in some small area, now your map looks like this:
Ughhh, where on earth those trees located?
Wouldn’t it be better to gather trees into some kind of bubble, put number of trees on it, so the person could distinguish high density of trees and make decision if that’s a good place to plant another one?
yes, that’s much better
Installing and setup prerequisite software
Here is the list of what need to be installed beforehand:
When the installation of PostgreSQL + PostGIS is done, we need to create new database and enable PostGIS extension. Now fire up psql in terminal and type the following commands:
In our case there will be used pretty simple clustering approach:
Get any point that not part of existing cluster.
Find all points in predefined radius that not part of existing cluster.
Mark found points that they belong to particular cluster.
Repeat while there will be no free points without cluster.
PostgreSQL + PostGIS implementation
Before start writing SQL queries we need to understand the basic idea of clustering points. There will be implemented dummy approach of finding points in specific radius and this radius will be defined for each zoom level. So, when on zoom level two: the radius will be 100 km and then while zooming in: the radius will be approximately divided by two. This is slow solution as for each zoom level we need to gather all points using different radius, the more optimized solution will be clustering on largest zoom level and then using clustered points as input to clustering for upper zoom level and so on.
Now we want to create table named places, with place latitude and longitude and cluster id for each zoom level. Cluster id is additional column to identify cluster this point belongs to. For each zoom level the additional table will be created with the name clusters_zoom
First of all we need sample data with places we want to cluster. The easiest way is to generate random points in predefined bounding box. In our example the bounding box will be located in San Francisco:
Sample data generated in bounding box.
Here is the main SQL function which is executed for each zoom level and gather points into clusters with the predefined radius. In this example the function is hard-coded for zoom level two and will be automated to execute for each zoom level in NodeJS module.
The make_cluster function is pretty straightforward:
All points are gathered using ST_DWithin function, which returns all geometries within the specified distance of one another. Pay attention to the fact that ST_DWithin takes as input geometry or geography type as first parameter and distance as the last one. In our case we use geography as we need to measure distance in meters.
After new cluster is created then the clusters_zoom and places tables are updated with the new cluster id.
Now, we are running this function to fill clusters_zoom table, while there exists points without cluster assigned to it:
Let’s assume that we have a table with all trees latitude, longitude and some additional data specified by user. What we want is to start clustering process in background as it takes some time to finish and then notify user that all data was processed successfully.
In order to do this we can create separate module called clusterapp.js which will be running as background process with database connection string as input parameter.
Before running clusterapp.js the following Node modules need to be installed:
The source code is pretty simple and automate SQL queries from the steps above:
After the clustering process is done we need to think about how to use it in our application. So, we have backend as PostgreSQL database with a bunch of tables with clustered points and we need API layer in order to retrieve clusters for selected area and specific zoom level.
In order to gather all points in specific area we can use PostGIS && bounding box intersect operator on centroid from clustered points table and actual bounding box of interested area created using ST_MakeEnvelope function.
Here is the full query:
In the NodeJS implementation we create api.js module with two endpoints:
clusters: Getting all clusters at zoom level and bounding box passed as parameter.
places: Getting original, non-clustered data at specific zoom level and bounding box.
Now, you can run api.js module and start testing query in your browser. The response result for each query will be GeoJSON with data that was pulled from database.
How to use in real application?
It doesn’t matter what map provider you are using as you have server-side clustering with API layer on top of it.
Here is basic example using google maps, which is listening on bounds_changed event and getting all clusters inside current bounding box:
As you probably noticed in api.js module, when getting clusters from PostgreSQL there used ST_Buffer function, which extend bounding box on some predefined distance. This is useful when you are trying to get clusters on bounds changed event. It’s better to make one query with extended visible area and then use cached result when user is panning map than flood your server with query for every bounds change.
Another thing that can speed-up clustering process it to perform hierarchical clustering. Instead of running the same clustering function on all places in your database and just changing the radius basing on zoom level, we can run this function on previously clustered data. So, for example the input points for clustering function for ninth zoom level would be clustered data from tenth zoom level and so on.