Using DuckDB, PDAL, and Python to Curate Point Clouds into GeoParquet and GeoTiff
A case study using USGS 3DEP Lidar and AIS Shipping Traffic point clouds
Overview:
This post summarizes my video walk-thru detailing the use of DuckDB, a high-performance embedded analytical database, for processing large geospatial datasets. The video focuses on using DuckDB with Python to efficiently handle tasks such as Fast Top-N Aggregation and Spatial Filtering on data from two distinct sources: USGS LiDAR data and Marine Cadastre AIS (Automatic Identification System) shipping traffic data. The main goals are to demonstrate how DuckDB can be leveraged for CPU-bound and I/O-bound operations, especially in situations where traditional Python methods struggle due to the Global Interpreter Lock (GIL). The workflow involves streaming data directly from URLs, multiprocessing, and multi-threading, making use of DuckDB's ability to bypass the Python GIL.
Key Themes:
DuckDB's Efficiency for Large Datasets: The core theme is the utilization of DuckDB's capabilities for handling and processing large datasets. The presentation argues that DuckDB is a powerful tool that's "somewhat immune to that Global Interpreter Lock" and can "use as many threads or CPU cores as it needs based on a user setting." This allows it to perform significantly better than standard Python approaches, even when using packages like Pandas.
Top-N Aggregation & Spatial Grouping: The video showcases DuckDB's capacity for performing "fast top-N aggregation and filtering by a group." Specifically, the talk delves into how to find the maximum draft of vessels within spatial bins, grouping by the spatial index and extracting the maximum draft value using a nested structure.
Integrating Diverse Data Sources: The workflow integrates two distinct geospatial data types: lidar data from USGS and AIS vessel data from Marine Cadastre. The presentation highlights how these data sets, which initially appear to be very different, can be combined spatially using a common indexing system based on Uber's H3 library, a discrete global grid system (DGGS), for creating spatial bins, making them usable in the same workflow.
Streaming Data from URLs: Both LiDAR and AIS data are accessed directly from URLs, emphasizing a workflow that streams the data into the system instead of requiring pre-downloaded data which allows for much larger datasets. This is combined with multiprocessing and multi-threading to speed up download, data ingestion, and processing of large point cloud and geospatial data.
Flexibility and Customization: DuckDB's flexible configuration options are emphasized, including the ability to set thread limits, control spillover memory behavior, and use custom temporary directories. "You can control the number of threads that you want DuckDB to be limited to... Same thing for spillover memory processing... DuckDB is very elegant about trying to manage what happens when it runs out of memory." These options allow for fine-tuning based on specific system resources and constraints.
Important Ideas and Facts:
Data Sources:USGS LiDAR: Publicly available data, often consisting of trillions of points. The dataset is Point-based, can be queried using an octree structure and the data can be downloaded as Entwine Point Tile format.
Marine Cadastre AIS: Daily files, in compressed CSV format, containing shipping traffic data from all US territories. The AIS data schema has been fixed since 2018. Files contain 17 attributes related to vessels and their movements. These files are unstructured data in that no spatial indexes are provided.
DuckDB and Python Integration:DuckDB can be used as an engine through Ibis for those unfamiliar with SQL, but the presenter prefers using SQL directly when writing queries.
While Pandas is traditionally single-threaded, DuckDB can leverage all available CPU cores, bypassing Python's Global Interpreter Lock (GIL) if configured as such.
DuckDB can also be set to use a temp directory and also the maximum directory size can be capped, if a user wants to have more control over memory usage.
Data Preparation and Processing: Both datasets are read directly from URLs, demonstrating the streaming data approach.
AIS data is loaded into Pandas DataFrames, where column headers and data types are fixed, before being registered as a table view in DuckDB.
The data from Pandas DataFrames are then copied to a parquet file where spatial indexes are created using Uber's H3 system at level 15 (approx. 1 meter cells) and level 11 (approx 45 meter cells).
The LiDAR data is processed using Ped-doll, with an initial filter being applied using a three-dimensional bounding box. Classifications other than ground or topo-bathy are excluded. Ped-doll's point tile format is also read directly from URL.
For both the AIS and LiDAR datasets, the H3 cell IDs are computed at the highest resolution (15) for spatial binning before doing a max function.
A secondary query allows for the rebining of the AIS data using the H3 level 11 spatial bin and for the grouping of the data by time window (daily, monthly, etc).
The system also has an additional function that will utilize a DuckDB table and a threading lock to write the data to DuckDB database on disk.
Key SQL Operations:The workflow involves creating a spatial index (H3 cell IDs) and grouping by the spatial index to find the maximum draft of AIS vessels within each bin.
The approach involves nesting all fields in the table into one for binary hash to speed up max value look up before grouping and then using the max function for all values grouped by the spatial indexes.
For the LiDAR data, a similar grouping and top-N approach is used to take the "highest priority classification" within each spatial bin.
Outputs:AIS data outputs as a GeoParquet files.
LiDAR data is output as a digital elevation model (DEM) in GeoTiff format, with embedded overlays for optimized browsing.
Both datasets can be co-located given that they are using a common H3 indexing system.
Quotable Highlights:
"DuckDB can use as many threads or CPU cores as it needs based on a user setting which is a fantastic option"
"...python can leverage DuckDB for CPU-bound and/or I/O-bound processes"
"DuckDB allows you to utilize a Pandas DataFrame as a SQL Table View"
"...the only difference between the AIS and the Lidar pipelines really is the implementation of PDAL to rasterize the points"
Conclusion:
This presentation highlights DuckDB as a potent tool for handling large-scale geospatial data. Its ability to work efficiently with streaming data, bypass the Python GIL, perform fast aggregations and grouping using a common H3 indexing system, and integrate with standard data science tools makes it a compelling option for geospatial processing pipelines. The examples provided with the AIS and LiDAR data demonstrate a real-world application where DuckDB's strengths can be leveraged for faster and more efficient data analysis.

