BlogTechnology and KnowHow

Introduction to PostGIS

Symphony logo
Symphony
December 8th, 2021

In this article, we will cover some basics of PostGIS and what it can do, along with some tips and tricks that can be used to simplify solutions or improve performance. In short - PostGIS is a Postgres extension that adds support for storing and manipulating spatial data types. We would typically need to use spatial data storage when we are building software applications that store, manipulate and visualize data on a map. We can use Google Maps or similar applications as a good example of what a typical geospatial visualization software does.

Before we can use PostGIS functionalities, we need to install the extension in Postgres.

CREATE EXTENSION IF NOT EXISTS postgis;


Spatial Data Types

PostGIS supports a few different types of (geo)spatial data types. Most of them have “cousins” in the world of graphic design. But unlike graphic software, where coordinates of the objects are relative to a screen or a piece of paper, geospatial coordinates reference points on the surface of the Earth. That makes it possible to render such objects on the map, but also to analyze interactions between them. This will come in handy as we start using spatial objects and operations to solve real-world problems.

Vectors

Similar to graphic design software, spatial vector data supports basic geometrical shapes, like point, line string, and polygon. In addition to basic geometries, PostGIS supports some more advanced ones:

  • Multi-version of basic geometries - a homogeneous collection of points, line strings or, polygons
  • 3D versions of basic geometries - the same as basic geometries with the Z-coordinate added
  • Geometry collection - a collection of any geometries, ​​homogeneous or heterogeneous
  • Polyhedral surface - complex 3D surface

Maps and navigation applications heavily rely on vector objects to model features of the map. Looking at the screenshot below, most objects on Google Maps can be represented as polygons (e.g., buildings) or points (e.g., businesses) or lines (e.g., roads). When viewing a map in 3D mode, buildings are often represented as polyhedral surfaces.

To create a table using the “geometry” data type, we can run the statement below

CREATE TABLE building (

id UUID PRIMARY KEY,

geom geometry

);


This creates a table with column “geom” of type geometry which is a generic type for all vector objects. Think of it as a base class in the OOP world. That means that we can combine points, lines, polygons, and other vector objects in the same column. If we know in advance which geometries we will be dealing with, we can specify that as a part of the column type definition. In that case, PostGIS will not allow other geometry types to be inserted in the same column. This is always a preferred way of storing data, as some operations expect geometries to be of the same type.

CREATE TABLE building (

id UUID PRIMARY KEY,

geom geometry(Polygon)

);


Furthermore, we can also include the SRID (spatial reference identifier) in the column type definition, forcing all values to conform to the same SRID. SRID will be covered in more detail later.

CREATE TABLE building (

id UUID PRIMARY KEY,

geom geometry(Polygon,4326)

);


Rasters

The spatial raster data type is also similar to its graphical design cousins (JPEG, PNG, TIFF, and other raster files we use in our daily lives), but there are some differences. 

  • Unlike the regular rasters, where one pixel is a dot on a screen or paper, spatial rasters have spatial resolution that defines pixel width and height. So each pixel of a spatial raster covers a uniformly sized rectangle on the map.
  • Spatial rasters have one or more bands, and each band has a matrix of values for all “pixels”. Data type of each band is set separately, and it can be almost any numeric type - binary (useful for masking), integers or floating-point values. In a way, it’s a generalization of the 24bit RGB rasters that we are used to working with in the graphical design world. A spatial equivalent of a 24bit RGB raster would be a 3-band raster where each band is defined as an unsigned 8bit int. However, with the flexibility of storing any numerical values in addition to colors, we can leverage rasters to store various information - terrain elevation, population density, vegetation information or metrics, etc.

Raster data support is included in a separate postgis extension that needs to be installed before we can work with rasters.

CREATE EXTENSION IF NOT EXISTS postgis_raster;


Then we can create a table using the raster type

CREATE TABLE satellite_image (

id UUID PRIMARY KEY,

rast raster

);


Point Cloud

Point cloud data format can be seen as a mix between rasters and vectors. It’s similar to rasters in a way that it represents a discrete data set, composed of individual points rather than shapes. However, unlike rasters, there is no resolution or density, so points can be anywhere in the 3D space. Comparing point clouds to vector types - it’s similar to a collection of 3D vector points.

Point cloud data is typically obtained from LiDAR, 3D scanners or similar devices that measure physical properties of objects in the 3D space. When visualized, it looks similar to the image below. The trees (or any other objects) look like continuous 3D objects, but they are all made up of discrete points in space.

Point cloud support is included in a separate postgis extension that needs to be installed before we can work with rasters.

CREATE EXTENSION pointcloud;

CREATE EXTENSION pointcloud_postgis;


Spatial Operations

When dealing with the “regular” non-spatial data, we typically join and filter tables based on exact values in a column containing primitive values representing object identifiers (integers, strings or maybe UUIDs). This is due to the nature of the problems we are typically solving in a relational database. Our typical queries on a non-spatial dataset may look something like this

SELECT *

FROM book b

INNER JOIN publisher p ON p.id = b.publisher_id;


Or this

SELECT *

FROM book b

WHERE b.publisher_id = 12345;


However, with spatial data, we generally do not have a real-world use-case that requires us to filter spatial objects by equality or to join the tables by matching spatial objects using the equality comparator. If we think for a moment about how Google Maps app works while we use it - zoom, pan, click on objects, we can deduce that the most commonly used operation on spatial data is the intersection. Whenever we pan or zoom the map, the system needs to figure out which objects should be fetched from the storage and rendered on the screen. That is typically done by intersecting objects with the rectangle that represents our visible portion of the map. The query below finds buildings that intersect a given rectangle on the map

SELECT *

FROM building b

WHERE ST_Intersects(b.geom, ST_MakeEnvelope(24, 47, 25, 48, 4326));


Another commonly used operation is distance calculation which is often used to determine which objects are located in the proximity of a given point on the map.

Spatial Indexing

When indexing primitive values, the database typically uses Hash or B-Tree to construct an index. Due to the difference in operations that are typically used with spatial data, this approach cannot be applied here. Spatial index needs to be constructed in a way that would allow us to efficiently find spatial objects from a collection of spatial objects that intersect with a given spatial object.

To solve this problem, spatial indexes use the R-Tree (“R” as in Rectangle) structure, which builds a tree of rectangles where each of the child node rectangles are contained in the parent node rectangle. At the leaf of the tree are the rectangles that represent bounding boxes of our spatial objects in a PostGIS column.

That way, we can quickly traverse the tree to find which objects intersect with a given object instead of checking each object for intersection. This reduces the time complexity of the filtering operation from O(N) to O(logN).

SQL command that creates a spatial index is very similar to the “regular” index creation

CREATE INDEX building_geom_idx ON building USING GIST(geom);


The only difference here is the “GIST” part, which signals PostGis that we require a “generic index structure” for this index. PostGIS supports three spatial indexes (GIST, SPGIST and BRIN) but in most cases, GIST is a decent choice.

It is worth noting that spatial indices can also be used for raster data, as we often need to get to the relevant rasters quickly. The same syntax can be applied to a raster column, but in this case we are indexing bounding boxes around raster images, so the statement needs to include ST_ConvexHull function.

CREATE INDEX satellite_image_rast_idx ON satellite_image USING GIST(ST_ConvexHull(rast));


As with any indexing, there is a performance tradeoff at the time objects are inserted into the database, as PostGIS needs to insert new objects into the R-Tree index. But whenever we plan to use spatial operations, we should consider adding an index to the columns used in the query, as it dramatically improves performance.

SRIDs

SRID (spatial reference identifier) is an important piece of information that we need to attribute to each spatial object. It includes information about the coordinate system, where the (0, 0) point is on the globe, the resolution of the coordinates, and how coordinates on the map correspond with actual points on the globe.

The most commonly used SRID is WGS84 — SRID 4326 which is used for GPS tracking, Google Maps and many other applications, however there are many more SRIDs that are popular, some offering higher precision than WGS84 in certain areas on the globe. So we always need to be aware of the SRID of the data coming into the system.

PostGIS is very flexible when it comes to SRIDs. In the example above, we created a table “building” with a geometry column without SRID specification. This means that PostGIS will allow polygons with any SRID to be inserted. This can sometimes be useful, even necessary, in cases when we cannot predict or change the SRID of the incoming data, but it should be avoided whenever possible. 

Spatial columns can also have a predefined SRID which forces all objects in that column to use the specified SRIDs. 

CREATE TABLE building (

id UUID PRIMARY KEY,

geom geometry(Polygon, 4326)

);


The first reason for having uniform SRIDs across all objects is that spatial queries require the objects of the same SRID and will fail if we try to intersect objects of different SRIDs

SELECT ST_Intersects(

ST_MakeEnvelope(24, 47, 25, 48, 4979), 

ST_MakeEnvelope(24, 47, 25, 48, 4326)

);


ERROR:  ST_Intersects: Operation on mixed SRID geometries (Polygon, 4979) != (Polygon, 4326)


There’s a workaround for this issue, but it will lead us to the next downside. Whenever we have mismatched SRIDs, we can convert one of the spatial objects to the SRID of the other object.

SELECT ST_Intersects(

ST_Transform(ST_MakeEnvelope(24, 47, 25, 48, 4979), 4326), 

ST_MakeEnvelope(24, 47, 25, 48, 4326)

);


In this query, ST_Transform transforms all the coordinates from the source SRID to the destination SRID and outputs a polygon with SRID 4326 that can be intersected with the other polygon with no errors.

However, there’s a performance penalty that comes with this approach. Firstly, this transformation will take some time. And more importantly, we will be unable to use spatial indexing to improve the performance of ST_Intersects operation because spatial index applies to the geometries in the original SRIDs, not to the transformed geometries in the target SRIDs. The query execution plan will need to perform a table scan on the first table to figure out which of the objects intersect the objects from the second table, AFTER being transformed to the destination SRID.

One way to deal with this problem is to perform ST_Transform on all objects at the time they are inserted into the database and maintain consistency between SRIDs at all times. This has many benefits, but it is worth noting that object transformation is not always exact, and we will lose some precision when transforming from one SRID to another. If the precision is critical for the software, it may be a good idea to store both the original and transformed object in the database and use them interchangeably.

Conclusion

In conclusion, this article gives a quick introduction to PostGIS, what it is, some of the spatial data types and operations it supports and some of the real-world problems that can be solved by leveraging PostGIS. We also got introduced to spatial indexing, the first stop for optimal performance. Hopefully, it helped with climbing the steep learning curve into the GIS world.

About the author

Branislav Stojkovic is a senior software engineer with more than 15 years of industry experience working at our engineering hub in Niš.

Branislav has extensive hands-on experience in designing, building, and maintaining large enterprise systems, mostly in the financial industry, and strong technical knowledge in back-end development using Java and .NET technologies and database modeling. The financial sector is often volatile and requires a deep understanding of the domain knowledge, and Branislav can dive deep into the details but also have a broad, high-level view of the whole system and the roadmap for the future. Agile methodologies are often employed to cope with the volatility of the financial industry. This search for optimal solutions started at an early stage of his career while working with algorithm design and competing in high school and university competitions.