logo
Spatial Data indexing performance with PostGIS
Leveraging indexing to optimize query performance.
13 May 2024
github

Introduction


Taxi trajectory data is an open-source dataset available on Kaggle, uploaded by "Chris Cross" in 2018. This dataset contains information about taxi movements in Porto, Portugal. The dataset consists of 9 columns with a total of 1,710,670 rows. There are 5,901 rows with empty trajectories considered invalid. Additionally, 81 duplicated trip IDs will be removed, and the first record of each duplication will be deleted.


Data Preprocessing


There are a few things that needs to be addressed before we can perform query and analyze the data. Firstly, the trajectory is stored in a collection of array instead of linestring that can be processed by PostGIS. We need to convert this data into geography data type. Another problem is that missing values are indicated by empty string and empty arrays of trajectory so we need to replace it with NULL and delete data that does not have any trajectory. All of these problem can be addressed with a few queries before loading the data to the target table. We can use a combination of do statement, conditional statement, and regular expression.


To augment the dataset, several popular hotels have been incorporated, along with their respective numbers of reviews and ratings. Additional column of ’popularity’ also have been calculated based on how many taxi have visited that hotel. We can infer this information by querying trajectory’s end point if it is within 5-meter radius of the hotel’s location.


A cluster of destination also have been introduced to identify common destination to travel and get a better insight of our data. Pyspark has been chosen as the framework for working this dataset, given its scalability and computational efficiency. The data is exported to a CSV file and loaded to Spark Data Frame. Following predictions made using K-Means, the Spark DataFrame will be exported as a CSV file and the data will be reloaded into PostgreSQL.


Performance Testing


Exploration of index utilization for optimizing spatial query performance in PostgreSQL with PostGIS. Testing involved R-tree, Quad-tree, and no-index scenarios.


Query 1


SELECT count(*) FROM public.taxi_quad
    WHERE ST_DWithin(
    ST_GeogFromText('SRID=4326;POINT(-8.68 41.25)'),start_point, 3150)

The first task addresses the question of how many people are traveling from the airport. We can answer this question by locating the Porto’s airport and querying whether the starting point of the taxi trajectory starts from within 3km, 3.1km, or 3.15km. Each radius will be tested for performance testing. For this problem, we are going to use ST DWithin function from PostGIS to determine whether a certain point is within those radii. From graphs below we can see that it is clear that R-tree performs better consistently throughout query tests. Whereas quad-tree does not seem to provide any benefit on this particular query.


trendtrend

Query 2


SELECT hotels.hotel, count(*) FROM public.taxi_quad, public.hotels
        WHERE ST_DWithin(
            hotels.geom,
            taxi_quad.end_point, 3)
GROUP BY hotel
ORDER BY count DESC

For second query, we are trying to search most popular hotel. In this case, popular is visited by a lot of taxis. We can find this information by querying the end of the taxi’s trip is near the hotel. There are several parameters that are tested in this task which are within 1, 3, and 5 meters of the hotel. From graph below, it clearly shows that r-tree has the lowest planning time amongst those three indexing method. Whereas, for execution time, it seems like quad-tree and r-tree has anomalies on several runs of the test. But, overall, r-tree still wins for this query.


trendtrend

Query 3

SELECT ST_HausdorffDistance(
           	ST_GeomFromText('LINESTRING(...)'),
           	polyline::geometry
        ) AS sim FROM taxi_rtree
ORDER BY sim ASC
LIMIT 5

For third query, we are trying to search similar trajectory to a popular destination. Hausdorff distance is used to measure how similar one trajectory to another. This function is available in PostGIS using ST HausdorffDistance function. The target trajectory is coming from the most common destination which is in cluster 11.


From graph below, we can see that planning time with R tree has a very consistent result across 95 queries. That said, accross three indexing method, there are no significant difference unlike previous query which indicates that PostgreSQL is using sequential scan to perform this query. We can confirm this by observing the explain analyze output.


trendtrend

Query 4

SElECT AVG(ST_NumPoints(polyline::geometry)) * 15 AS time FROM taxi_quad, (
            SELECT geom FROM hotels
            ORDER BY popularity DESC
            LIMIT 1
        ) AS subquery
    WHERE ST_DWithin(subquery.geom, taxi_quad.end_point, 10) AND trip_id IN (
        SELECT trip_id
        FROM taxi_quad
        WHERE ST_DWithin(taxi_quad.start_point, ST_GeogFromText('SRID=4326;POINT(-8.68 41.25)'), 3150)
    )

In query 4 we are trying to find the average time that is needed to travel from airport to the most popular hotel. We can retrieve how much time spent in a trip by counting number of points in the polyline and multiply it by 15 seconds. We can also query the starting point that is within the airport and the end location within the hotel. Therefore, we can calculate the average traveling time.


Based on result, we can see that from query planning graph, R tree performs significantlly and consistently better than the other two indexing methods. However, query execution graph, no index seems to perform better. This might happen because the database engine can utilize the index to get a faster estimation of query and when it executes the plan it needs to access both index data structure and the data itself.


trendtrend

Analysis

trendtrend

From this experiment, it is evident that indexing on spatial data is not as easy as it might seem. It might and might not work on certain cases. It really depends on the use case, our query structure and how we can creatively leverage it. The indexing algorithm is also another thing that we should consider when we are going to index our multidimentional column. On this particular dataset R tree performs better rather than quad tree and no index.