When working on adding location-based search to a DynamoDB database project, I discovered there's more than one way to handle geospatial queries. The general principle is that geospatial queries can be managed by storing a geohash as an attribute for each item in the database. A geohash is a string that has been created based on longitude and latitude. The length of the geohash determines how wide area it covers - essentially the whole earth is divided into boxes and you can choose the size of the boxes you want to work with. I will not go into a deeper explanation of it here as there are plenty of good resources online that explain in detail the way geohashes work.
Looking for a way to implement this, made me first look into the DynamoDB Geohash library. I wanted to understand whether this package would work with my existing schema where I was already planning to store all of the items in a single DynamoDB table.
With dynamodb-geo-npm
I started by creating a sample project where I would create a new table using the library to see how exactly it works. The table that was created had this schema:
The geohash attribute contains the full geohash, whereas the hashKey attribute contains the first 5 characters of the geohash to ensure the items are evenly distributed across partitions.
In addition, the library created a local secondary index and that is where the actual search based on the geohash will happen:
Creating queries using the library is very simple, you simply define the coordinates of your centre point and the radius in meters around it where you want your search locations to be located.
The limitations is that the library doesn’t support composite primary keys. So let’s say you have different type of items, such as schools and libraries. Now, if you wanted to search only for the libraries within certain location, you would ideally create some kind of composite key that allows you to separate these type of items as required by your access patterns. When you are not able to have that level of separation, you end up getting both types of items in the search results. So in order for this to work, you would need to have a separate DynamoDB table for schools and another one for libraries or alternatively apply filters in the application layer after retrieving the results.
Without dynamodb-geo-npm
Based on the limitations, I wanted to try how to implement this in a way that would work with the existing schema where all items are stored within one table following the principle of single-table design. This required a couple of steps that had been handled behind the scenes when using the library.
The first step was to use the ngeohash
library to create a geohash for each item that is saved in the database. For that you need the coordinates of the location and the library will then create a geohash from these coordinates. This geohash will be a string containing letters and numbers. Previously when testing with dynamodb-geo, the geohashes were actually number-only hashes as the library internally converted the format from the alphanumeric hash. I then saved the geohash in the existing database and created a global secondary index that would allow me to search based on the item type (library or school etc) and the geohash:
It is important to have the geohash as the sort key, so that you are able to later use the ‘begins with’ type queries.
In order to query from this tale, the first step is to determine a ‘bounding box’ around the centre point. This means that you have a geopoint and then determine a radius such as 2000m around it. Using the geolib
library it was easy to get a bounding box coordinates around a centre point:
As the bounding box is a ‘box’ it will end up having some coordinates that are actually outside of the defined radius as described above. These locations can be filtered out later if only the locations within a certain distance are required.
The next step was to determine what the geohashes based on these coordinates are. This can be done using the ngeohash library. The library will simply return you an array of geohashes based on the defined coordinates and the length of geohash that you want. The length of geohash determines its precision and the size of area it covers. So that is something that you will have to decide based on your exact scenario. For example for the 2000m radius a 6 character geohash might be the best option as using only 5 characters would give you too wide cells whereas using 7 characters would be too fine-grained and end up requiring more queries than necessary. The concept might seem confusing at first but testing with a few different variations will make it easier to see what the options are and what makes sense. Based on the coordinates and length of desired geohash, the ngeohash library will return you an array of geohash cells:
These cells are now again wider than the actual bounding box coordinates were, just based on the way these cells are calculated. So you will end up having again more results that don’t belong inside the exact search radius. So again, these need to be filtered out in the end if you require results that are only within the defined radius.
Now when you have a list of hashes, you would need to make a query for each of these hashes. These can be done using the GSI. As the GSI sort key was created using the full geohash, now when searching for items we need to use the ‘begins with’ option as we are searching items within a certain radius rather than the exact geolocation as geohashes have a hierarchical property where prefixes represent larger areas:
It is worth mentioning that although this system will require several database queries, the same would apply if using the dynamodb-geo library as it would also do several database queries behind the scenes (according to the documentation typically 8 queries are executed per radius search).
As a result, you would have a list of library locations that are located within those geohashes. Each item would have it’s coordinates and based on those coordinates it is possible to do further organising such as finding out which location is closest to the exact coordinates of the search point and calculate what the distance is by using the haversine formula. And as mentioned before, do to the way the search was done, there would be some items in the results that don’t fit in within the exact radius, so those would need to be filtered out if that’s your requirement.
Conclusion
Creating a geohash based search without using the dynamodb-geo library is not a very complicated thing to do and the steps are quite straightforward. But of course deciding which option to go with is a more complicated decision and there are many trade-offs to consider. The library would be optimised and if you are working with your own schema, the latency and cost will depend on the way you have designed your database and access patterns. The main aspects impacting the performance of your design would the geohash precision (length), number of queries that are needed and the post-processing overhead depending how much you need to further process the items after the database queries have been completed. So as usual with DynamoDB, this would require careful considerations and monitoring.