Jesper Fjellin's Github Pages

dbfriend - PostGIS Database Management

dbfriend is a Python command-line tool designed to simplify the loading and synchronization of spatial data into PostGIS databases. It focuses on data integrity and safety, ensuring that your database operations are reliable and efficient. By handling complex tasks intelligently, dbfriend helps GIS professionals and database administrators streamline their workflows. Github repo

Key Features

Transactional Operations

All database operations are executed within transactions, ensuring data integrity and automatic rollback on failure.

Automated Table Backups

dbfriend automatically creates backups before modifying any existing tables, keeping up to three historical versions per table for easy restoration and added data safety.

Supports Multiple Vector Formats

Load data from various spatial file formats, including GeoJSON, Shapefile, GeoPackage, KML, and GML, providing flexibility in handling different data sources.

Intelligent Geometry Comparison

Prevent duplicates and ensure data consistency by comparing geometries using hashes to detect new, updated, and identical features efficiently.

Attribute-Aware Updates

Update existing geometries based on attribute changes, so your database always reflects the most current data.

Automatic Geometry Handling

Automatically detects and renames geometry columns to a standard format, simplifying data processing and integration.

CRS Compatibility Checks and Automatic Reprojection

Verifies CRS compatibility and automatically reprojects data as needed, ensuring spatial data aligns correctly within your database.

Spatial Index Creation for Optimized Queries

Automatically creates spatial indexes on imported data, improving query performance and data retrieval speeds.

Demonstration

dbfriend in action: processing spatial files and managing PostGIS database operations.

Sosilogikk

Sosilogikk is a Python module intended to streamline the use of Python libraries like Shapely or Fiona for GIS analyses, on the Norwegian vector data format SOSI (Samordnet Opplegg for Stedfestet Informasjon). Sosilogikk allows the user to seamlessly load a .SOS file into a GeoPandas GeoDataFrame through only a few lines of code. Github repo

Example structure of a vector object in a SOSI-file. The dot- and coordinates-format makes it difficult to use with Python libraries.

Example SOSI-file loaded into GeoDataFrame

Sosilogikk applied to a large SOSI-file, resulting in an excel-like table.

Using the .to_file method, you can easily export the GeoDataFrame to any OGR-supported vector format, allowing software like ArcGIS or QGIS to be used.

Drainage lines in Flatgeobuf format visualized in QGIS.

Delta Encoding and MongoDB - Optimizing for Cloud Computing

Modern cloud-native GIS applications often need to efficiently store and transmit large volumes of geographic data between services. While GeoJSON is the standard format for geographic data exchange, its text-based nature makes it suboptimal for cloud storage and transmission. This solution combines MongoDB's BSON format with delta encoding to create a highly efficient geographic data pipeline. Github repo

BSON and MongoDB in Cloud Computing

BSON (Binary JSON) is MongoDB's binary format, specifically designed for cloud-scale data operations. Unlike traditional JSON, BSON provides native support for different numeric types and binary data, making it ideal for geographic coordinate storage. This becomes particularly important in microservice architectures where data needs to be efficiently serialized, transmitted, and stored across different cloud services. In cloud environments where MongoDB Atlas is increasingly common, this native format compatibility translates to significant performance benefits and reduced processing costs.

Understanding Delta Encoding

Delta encoding is a compression technique that stores the differences (deltas) between consecutive values rather than the values themselves. For geographic coordinates, this is particularly effective because consecutive points in a geometry are typically close to each other, resulting in small delta values that require fewer bits to store.

Visualization of delta encoding: Starting with a sequence of numbers (top row), we compute the differences between consecutive values (second row). Negative differences are then shifted to positive values (third row) for efficient binary representation (bottom row). This process significantly reduces storage requirements while maintaining perfect reversibility. (Adapted from Xia et al., The VLDB Journal, 2024)

Implementation Approach

The implementation in BSON_encoder.py follows these key steps:

Scale and Convert to Integers: First, we scale the floating-point coordinates (typically by 1e6) and convert them to integers to preserve precision while enabling efficient delta calculations.
Calculate Deltas: For each point after the first, we store the difference from the previous point rather than the absolute coordinates.
BSON Serialization: The delta-encoded coordinates are then serialized to BSON format, which provides efficient storage of integer arrays.
GZIP Compression: Finally, we apply GZIP compression to further reduce the size of the encoded data.

Here's a simplified example showing the transformation:


            Original coordinates: [(100.123456, 50.123456), (100.123476, 50.123476)]
            After scaling by 1e6: [(100123456, 50123456), (100123476, 50123476)]
            Delta encoded: [(100123456, 50123456), (20, 20)]  # Second point stored as difference

Compressions Results

In modern cloud architectures, geographic data flows between various services - from storage to processing to web APIs. This combined approach of delta encoding and BSON serialization dramatically reduces the bandwidth required for these operations. Testing with real-world infrastructure data:

Data size comparison between GeoJSON and BSON formats

Comparison of data sizes: Original GeoJSON format vs BSON-encoded format with delta compression. The combined approach reduces the file size by almost 90% while maintaining full coordinate precision.

Cloud Integration Benefits

While the size reduction is impressive, the real value lies in the format's cloud-native nature. The compressed data remains fully compatible with MongoDB's geospatial queries and indexes, allowing for efficient spatial operations directly on the compressed data. The compression is completely reversible, and the flattened GeoJSON structure results in smaller file sizes even after decompression.

Docker in Production Environments - Bridging Technical Gaps

Working in GIS production environments has highlighted an interesting challenge: the gap between what can be automated and what typically is automated. While tools like ArcGIS and QGIS excel at interactive analysis, many workflows would benefit from programmatic automation - yet often remain manual processes.

The Automation Challenge in GIS

GIS workflows frequently involve repetitive tasks that are perfect candidates for automation:

Database-wide topology validation and error checking
Scheduled quality control processes
Statistical aggregation of incoming project data
Automated spatial sampling and analysis

The challenge isn't identifying what to automate - it's making automation accessible to GIS professionals who may not have programming experience. This is where Docker has proven particularly valuable.

Docker as a Bridge

Docker's containerization approach solves several fundamental challenges in GIS automation:

It eliminates the complexity of Python environment management
It ensures consistent spatial libraries across different machines
It packages all dependencies in a single, shareable unit
Most importantly, it makes advanced automation accessible to non-programmers

From Theory to Practice

In practice, implementing Docker in a GIS environment involves creating a layer of abstraction between the technical complexity and the end user. The implementation typically involves wrapping Docker commands in a user-friendly interface - the user doesn't need to understand the underlying system, they simply interact with familiar buttons and inputs while Docker handles the complex environment management behind the scenes. To accomplish this, we can create a launcher script in the form of a batch file that presents the user with inputs through a simple graphical user interface. As I work in an environment where Python comes pre-installed, I chose to use a Python script for this task.

Docker-based GIS tool launcher interface

Example of a Docker-based GIS tool launcher in Python that abstracts away the complexity of container management and environment setup.

Reflections on Production Use

Using Docker in production has revealed several interesting insights:

Environment Consistency: The "it works on my machine" problem essentially disappears
Version Management: Docker images provide a reliable way to track and roll back changes
Distribution: Updates to spatial analysis tools can be pushed through Docker Hub without requiring end-user intervention
Isolation: Each process runs in its own container, preventing system-wide conflicts

Looking Forward

The integration of Docker in GIS workflows opens interesting possibilities for the future of spatial data processing. As cloud infrastructure becomes more prevalent in GIS, containerized workflows could become the standard way of handling automated spatial analysis. The key will be maintaining the balance between powerful automation capabilities and user-friendly interfaces.

Rust Bindings in Python - When Fast Data Processing Matters

Python is a powerful language for rapid development, especially in the GIS domain, thanks to libraries like GeoPandas and Shapely. However, when processing large datasets or performing complex calculations, Python's speed can become a limitation. This is where Rust comes in - offering the speed we need while letting us keep Python's ease of use.

Understanding Rust Bindings

Bindings are essentially a way to connect two different programming languages, allowing them to work together. In this case, we use Rust bindings to integrate Rust's high-performance capabilities into Python workflows. This means we can write the most performance-critical parts of our GIS analysis in Rust, while still using Python for the overall workflow.

Why Use Rust?

Many traditional GIS tools are written in C++, and for good reason - C++ offers excellent performance and has been the go-to language for computationally intensive tasks for decades. However, Rust brings some unique advantages to the table. While matching C++'s performance, Rust's compiler enforces memory safety and thread safety at compile time, preventing many common programming errors before they can become runtime bugs. This is particularly valuable when working with large spatial datasets where data integrity is crucial.

Additionally, Rust's modern tooling and package management system makes it easier to create and maintain bindings compared to C++. The language's focus on safe concurrency also makes it particularly well-suited for parallel processing of spatial data, an increasingly important consideration as datasets continue to grow in size and complexity.

Performance Comparison

To demonstrate the performance difference between Python and Rust, I performed a simple GIS task: creating buffers around 1 million point geometries, and checking how many of the buffers overlapped with each other. The results were striking - the Rust implementation completed in just 2 seconds, while the Python version took 84 seconds to finish the same task.

Results from the performance comparison.

Rust Implementation Highlights

The key features that make this implementation fast:

Python Integration

How we expose the Rust function to Python:

Integrating Rust with Python

By incorporating Rust into a Python-based workflow, we can leverage the strengths of both languages. Python remains the glue that holds the workflow together, providing ease of use and flexibility, while Rust handles the heavy lifting where performance is critical. This combination allows us to build robust GIS applications that are both user-friendly and highly efficient.

Building Custom Topology Testing Solutions

An exploration of why and how to build custom topology validation tools in an era of increasingly complex geospatial data relationships. While traditional GIS tools offer built-in topology checks, modern spatial data often requires more nuanced, domain-specific validation rules. Github repo

Why Custom Topology Testing?

As geospatial data becomes more complex, the relationships between features often extend beyond simple geometric rules. For example, a road intersection might be valid or invalid based on multiple factors:

Physical infrastructure (bridges, tunnels)
Administrative classifications
Temporal constraints
Domain-specific business rules

While tools like ArcGIS, QGIS, and PostGIS provide robust basic topology checks, they may not capture these nuanced relationships without significant customization.

A Rule-Based Approach

This project demonstrates how to build a flexible topology testing framework that separates validation rules from the validation logic. Using external configuration files, domain experts can define what constitutes a valid topological relationship:


                        {
                            "global_settings": {
                                "id_attribute": "id",
                                "output_folder_name": "TopologyTest_Output",
                                "tolerances": {
                                    "gap": 0.001,
                                    "overlap": 0.001
                                },
                                "enabled_checks": {
                                    "intersections": true,
                                    "self_intersections": true,
                                    "gaps": true,
                                    "dangles": true,
                                    "overlaps": true,
                                    "containment": true
                                }
                            },
                            "dataset_rules": {
                                "roads": {
                                    "allow_intersection_if": [
                                        {
                                            "attribute": "terrain",
                                            "values": ["bridge", "tunnel", "air"]
                                        }
                                    ],
                                    "allow_overlap_if": [
                                        {
                                            "attribute": "type",
                                            "values": ["service_road", "emergency_lane"]
                                        }
                                    ],
                                    "check_dangles": true,
                                    "check_self_intersections": true
                                },
                                "buildings": {
                                    "allow_intersection_if": [],
                                    "allow_overlap_if": [],
                                    "check_gaps": true,
                                    "gap_tolerance": 0.5,
                                    "check_containment": true
                                }
                            }
                        }

Terminal print of the topology validation results when two files are tested against rules set by the user.

Implementation Strategy

The framework demonstrates several key principles for custom topology testing:

Separation of Concerns: Keeping validation rules separate from the validation engine allows for easy updates as business rules evolve
Attribute-Aware Validation: Moving beyond pure geometry to consider feature attributes and relationships
Extensibility: A modular design that allows for adding new types of topology checks as needs arise
Clear Reporting: Generating results that help users understand and fix topology issues in their specific context

This approach shows how organizations can build tools that validate not just geometric correctness, but also domain-specific spatial relationships that matter to their business processes.

# Begin transaction for all operations try: conn.autocommit = False # Initialize rich Progress with Progress( SpinnerColumn(), TextColumn("[cyan]{task.description:<30}"), BarColumn(bar_width=30), "[progress.percentage]{task.percentage:>3.0f}%", TimeElapsedColumn(), console=console, expand=False ) as progress: task = progress.add_task(" Processing files", total=len(file_info_list)) for info in file_info_list: file = info['file'] table_name = args.table if args.table else info['table_name'] qualified_table = f"{schema}.{table_name}" gdf = info['gdf'] try: logger.info(f"Processing {file}") # Handle geometry column naming if gdf.geometry.name != 'geom': gdf = gdf.rename_geometry('geom') gdf.set_geometry('geom', inplace=True) gdf.set_crs(gdf.crs, inplace=True) # Ensure valid CRS if not gdf.crs: logger.warning(f"No CRS found in {file}, defaulting to EPSG:4326") gdf.set_crs(epsg=4326, inplace=True) if args.table: # When using --table, only keep the geometry column gdf = gdf[['geom']] if table_name not in existing_tables: # Get SRID from data or args srid = args.epsg if args.epsg else gdf.crs.to_epsg() if not srid: srid = 4326 # Default to WGS84 if no SRID found # Create table with generic geometry type and SRID if create_generic_geometry_table(conn, engine, table_name, srid, schema): existing_tables.append(table_name) else: continue # For new tables, all geometries are new logger.info(f"Found {format(len(gdf), ',').replace(',', ' ')} [green]new[/] geometries.") # Append first batch of geometries logger.info(f"Appending {format(len(gdf), ',').replace(',', ' ')} geometries to '{qualified_table}'") if append_geometries(conn, engine, gdf, table_name, schema): total_new += len(gdf) else: # Compare geometries before appending new_geoms, updated_geoms, identical_geoms = compare_geometries( gdf, conn, table_name, 'geom', schema=schema, exclude_columns=[], args=args ) # Create summary of differences for this dataset num_new = len(new_geoms) if new_geoms is not None else 0 num_updated = len(updated_geoms) if updated_geoms is not None else 0 num_identical = len(identical_geoms) if identical_geoms is not None else 0 logger.info(f"Found {format(num_new, ',').replace(',', ' ')} [green]new[/] geometries, " f"{format(num_updated, ',').replace(',', ' ')} [yellow]updated[/] geometries, and " f"{format(num_identical, ',').replace(',', ' ')} [red]identical[/] geometries skipped.") if new_geoms is not None and not new_geoms.empty: logger.info(f"Appending {format(len(new_geoms), ',').replace(',', ' ')} geometries to '{qualified_table}'") # Use schema parameter in to_postgis new_geoms.to_postgis( name=table_name, con=engine, schema=schema, if_exists='append', index=False ) total_new += len(new_geoms) if identical_geoms is not None: total_identical += len(identical_geoms) elif table_name in existing_tables: logger.info(f"Table {qualified_table} exists, analyzing differences...") # Get common columns between GDF and database table common_columns = [col for col in gdf.columns if col != 'geom'] if not common_columns: # If no common columns, just compare geometries columns_sql = "MD5(ST_AsBinary(geom)) as geom_hash" else: quoted_columns = ', '.join(f'"{col}"' for col in common_columns) columns_sql = f"MD5(ST_AsBinary(geom)) as geom_hash, {quoted_columns}" # Check existing geometry column name in PostGIS existing_geom_col = get_db_geometry_column(conn, table_name, schema=schema) # Only keep 'geometry' if it's the existing column name, otherwise use 'geom' target_geom_col = 'geometry' if existing_geom_col == 'geometry' else 'geom' if gdf.geometry.name != target_geom_col: logger.debug(f"Renaming geometry column from '{gdf.geometry.name}' to '{target_geom_col}'") gdf = gdf.rename_geometry(target_geom_col) gdf.set_geometry(target_geom_col, inplace=True) gdf.set_crs(gdf.crs, inplace=True) # Preserve CRS new_geoms, updated_geoms, identical_geoms = compare_geometries( gdf, conn, table_name, gdf.geometry.name, schema=schema, exclude_columns=exclude_cols, args=args ) # Create summary of differences num_new = len(new_geoms) if new_geoms is not None else 0 num_updated = len(updated_geoms) if updated_geoms is not None else 0 num_identical = len(identical_geoms) if identical_geoms is not None else 0 total_new += num_new total_updated += num_updated total_identical += num_identical logger.info(f"Found {format(num_new, ',').replace(',', ' ')} [green]new[/] geometries, " f"{format(num_updated, ',').replace(',', ' ')} [yellow]updated[/] geometries, and " f"{format(num_identical, ',').replace(',', ' ')} [red]identical[/] geometries skipped. " "Skipping identical geometries...") # Handle new geometries if num_new > 0: try: # Use schema parameter in to_postgis new_geoms.to_postgis( name=table_name, con=engine, schema=schema, if_exists='append', index=False ) logger.info(f"Successfully appended {num_new} new geometries to {qualified_table}") except Exception as e: logger.error(f"Error appending new geometries: {e}") else: pass # Handle updated geometries (if implemented) if num_updated > 0: update_geometries(updated_geoms, table_name, engine, unique_id_column='osm_id') # Adjust unique_id_column as needed else: num_geometries = len(gdf) logger.info(f"Found {num_geometries} new geometries to import into new table '{qualified_table}'") # Add coordinate printing for new tables if args.coordinates: for idx, row in gdf.iterrows(): print_geometry_details(row, "NEW", args.coordinates) try: # Write to PostGIS with schema gdf.to_postgis( name=table_name, # Use unqualified name con=engine, schema=schema, # Specify schema separately if_exists='replace', index=False ) # Verify the table was created cursor = conn.cursor() cursor.execute(""" SELECT EXISTS ( SELECT 1 FROM information_schema.tables WHERE table_schema = %s AND table_name = %s ); """, (schema, table_name)) table_exists = cursor.fetchone()[0] cursor.close() if table_exists: logger.info(f"Successfully imported {num_geometries} geometries to new table '{qualified_table}'") create_spatial_index(conn, table_name, schema=schema, geom_column='geom') existing_tables.append(table_name) total_new += num_geometries else: logger.error(f"Failed to create table '{qualified_table}'") except Exception as e: logger.error(f"Error importing '{file}': {e}") continue except Exception as e: logger.error(f"Error processing '{file}': {e}") continue progress.advance(task) # Commit everything at once conn.commit() logger.info("All changes committed successfully") logger.info("Summary of tasks:\n" f"{format(total_new, ',').replace(',', ' ')} [green]new[/] geometries added, " f"{format(total_updated, ',').replace(',', ' ')} [yellow]updated[/] geometries, " f"{format(total_identical, ',').replace(',', ' ')} [red]identical[/] geometries skipped") except Exception as e: conn.rollback() # Rollback the transaction on error logger.error(f"An error occurred: {e}. All changes have been rolled back.") # Optionally restore from backups here finally: conn.autocommit = True # Reset autocommit

def backup_tables(conn, tables, schema='public'): """Create file backups of all affected tables before processing.""" timestamp = datetime.datetime.now().strftime('%Y%m%d_%H%M%S') backup_dir = os.path.join(os.getcwd(), 'backups') backup_info = {} for table in tables: if not check_table_exists(conn, table, schema): logger.info(f"Table '{schema}.{table}' does not exist, no backup needed.") continue backup_file = os.path.join(backup_dir, f"{table}_backup_{timestamp}.sql") try: # Create pg_dump command cmd = [ 'pg_dump', f'--host={conn.info.host}', f'--port={conn.info.port}', f'--username={conn.info.user}', f'--dbname={conn.info.dbname}', f'--table={schema}.{table}', '--format=p', f'--file={backup_file}' ] # Set PGPASSWORD environment variable for the subprocess env = os.environ.copy() env['PGPASSWORD'] = conn.info.password # Execute pg_dump subprocess.run(cmd, env=env, check=True, capture_output=True) backup_info[table] = backup_file logger.info(f"Created backup of '{schema}.{table}' to '{backup_file}'") # Manage old backups manage_old_backups(backup_dir, table) except subprocess.CalledProcessError as e: logger.error(f"Failed to backup table '{schema}.{table}': {e.stderr.decode()}") return None return backup_info

def process_files(args, conn, engine, existing_tables, schema): total_new = 0 total_updated = 0 total_identical = 0 # Determine file extensions to process supported_extensions = ['.shp', '.geojson', '.json', '.gpkg', '.kml', '.gml'] # Collect files to process (only in specified directory, not subdirectories) file_info_list = [] # List files only in the specified directory for file in os.listdir(args.filepath): if any(file.lower().endswith(ext) for ext in supported_extensions): full_path = os.path.join(args.filepath, file) # Skip if not a file (e.g., if it's a directory) if not os.path.isfile(full_path): continue table_name = os.path.splitext(file)[0].lower() try: gdf = gpd.read_file(full_path) # Handle CRS source_crs = gdf.crs if args.epsg: # User specified an EPSG if source_crs and source_crs.to_epsg() != args.epsg: logger.info(f"Reprojecting from EPSG:{source_crs.to_epsg()} to EPSG:{args.epsg}") gdf.set_crs(source_crs, inplace=True) # Ensure source CRS is set gdf = gdf.to_crs(epsg=args.epsg) else: gdf.set_crs(epsg=args.epsg, inplace=True) elif not source_crs: # No source CRS and no user EPSG specified, default to 4326 logger.warning(f"No CRS found in {file}, defaulting to EPSG:4326") gdf.set_crs(epsg=4326, inplace=True) else: # Keep source CRS pass input_geom_col = gdf.geometry.name file_info_list.append({ 'file': file, 'full_path': full_path, 'table_name': table_name, 'gdf': gdf, 'input_geom_col': input_geom_col }) except Exception as e: logger.error(f"[red]Error reading '{file}': {e}[/red]") continue if not file_info_list: logger.warning("[red]No spatial files found to process.[/red]") return

def compare_geometries(gdf: GeoDataFrame, conn, table_name: str, geom_column: str = 'geom', schema: str = 'public', exclude_columns: List[str] = None, args=None): cursor = conn.cursor() # Get the actual geometry column name from the database db_geom_column = get_db_geometry_column(conn, table_name, schema=schema) if not db_geom_column: logger.error(f"No geometry column found in table '{schema}.{table_name}'") return None, None, None sql = f""" SELECT MD5(ST_AsBinary({db_geom_column})) as geom_hash FROM {schema}.{table_name} """ # Get existing geometry hashes from database existing_hashes = set() with conn.cursor() as cur: cur.execute(sql) for row in cur.fetchall(): existing_hashes.add(row[0]) # Create temporary copy of GDF for comparison comparison_gdf = gdf.copy() comparison_gdf['geom_hash'] = comparison_gdf[geom_column].apply(compute_geom_hash) # Compare with database hashes new_geometries = [] identical_geometries = [] for idx, row in comparison_gdf.iterrows(): geom_hash = row['geom_hash'] if geom_hash in existing_hashes: identical_geometries.append(row) else: new_geometries.append(row) # Convert lists to GeoDataFrames new_gdf = GeoDataFrame(new_geometries, geometry=geom_column, crs=gdf.crs) if new_geometries else GeoDataFrame(columns=gdf.columns) identical_gdf = GeoDataFrame(identical_geometries, geometry=geom_column, crs=gdf.crs) if identical_geometries else GeoDataFrame(columns=gdf.columns) # Remove temporary hash column for gdf_temp in [new_gdf, identical_gdf]: if 'geom_hash' in gdf_temp.columns: gdf_temp.drop('geom_hash', axis=1, inplace=True) return new_gdf if not new_gdf.empty else None, None, identical_gdf if not identical_gdf.empty else None

def update_geometries(gdf, table_name, engine, unique_id_column): """Update existing geometries in PostGIS table.""" if gdf is None or gdf.empty: return try: # Create temporary table for updates temp_table = f"temp_{table_name}" gdf.to_postgis(temp_table, engine, if_exists='replace', index=False) with engine.connect() as connection: from sqlalchemy import text # First, check for and add any new columns cursor = connection.execute(text(f""" SELECT column_name FROM information_schema.columns WHERE table_schema = 'public' AND table_name = '{table_name}' """)) existing_columns = {row[0] for row in cursor} # Get new columns from the GeoDataFrame new_columns = set(gdf.columns) - existing_columns # Add any new columns to the main table for col in new_columns: # Determine column type from GeoDataFrame dtype = gdf[col].dtype if dtype == 'object': sql_type = 'TEXT' elif dtype == 'int64': sql_type = 'INTEGER' elif dtype == 'float64': sql_type = 'DOUBLE PRECISION' else: sql_type = 'TEXT' # Default to TEXT for unknown types logger.info(f"Adding new column '{col}' with type {sql_type}") connection.execute(text(f'ALTER TABLE "{table_name}" ADD COLUMN IF NOT EXISTS "{col}" {sql_type}')) # Now proceed with the update columns = [col for col in gdf.columns if col != unique_id_column] update_cols = ", ".join([f'"{col}" = s."{col}"' for col in columns]) sql = text(f""" UPDATE "{table_name}" t SET {update_cols} FROM "{temp_table}" s WHERE t.{unique_id_column} = s.{unique_id_column} """) logger.debug(f"Executing update SQL: {sql}") connection.execute(sql) connection.execute(text(f'DROP TABLE IF EXISTS "{temp_table}"')) connection.commit() logger.info(f"Successfully updated {len(gdf)} geometries in {table_name}") except Exception as e: logger.error(f"Error updating geometries: {e}")

# Collect geometry column names geom_col_files = defaultdict(list) for info in file_info_list: geom_col_files[info['input_geom_col']].append(info['file']) # Handle geometry column renaming for info in file_info_list: gdf = info['gdf'] if gdf.geometry.name != 'geom': gdf = gdf.rename_geometry('geom') gdf.set_geometry('geom', inplace=True) gdf.set_crs(gdf.crs, inplace=True) # Preserve CRS info['gdf'] = gdf info['input_geom_col'] = 'geom'

def check_crs_compatibility(gdf, conn, table_name, geom_column, args): cursor = conn.cursor() # Check if the table exists cursor.execute(""" SELECT EXISTS ( SELECT 1 FROM information_schema.tables WHERE table_schema = 'public' AND table_name = %s ); """, (table_name,)) table_exists = cursor.fetchone()[0] if not table_exists: # Table does not exist, proceed without CRS check cursor.close() return gdf # Proceed with the current GeoDataFrame # Table exists, retrieve existing SRID using ST_SRID try: cursor.execute(f""" SELECT ST_SRID("{geom_column}") FROM "{table_name}" WHERE "{geom_column}" IS NOT NULL LIMIT 1; """) result = cursor.fetchone() if result: existing_srid = result[0] logger.info(f"Existing SRID for '{table_name}' is {existing_srid}") else: logger.warning(f"No geometries found in '{table_name}' to determine SRID") existing_srid = None except Exception as e: logger.error(f"Error retrieving SRID for '{table_name}': {e}") conn.rollback() cursor.close() return None # Skip this file due to error # Get the SRID of the new data new_srid = gdf.crs.to_epsg() if new_srid is None: logger.warning(f"No EPSG code found for the CRS of the new data for '{table_name}'") if args.overwrite: action = 'y' else: action = console.input(f"Proceed without CRS check for '{table_name}'? (y/n): ") if action.lower() != 'y': logger.info(f"Skipping '{table_name}' due to unknown CRS") cursor.close() return None # Skip this file else: logger.info(f"CRS of new data for '{table_name}' is EPSG:{new_srid}") # Compare SRIDs if existing_srid and new_srid != existing_srid: logger.warning(f"CRS mismatch for '{table_name}': Existing SRID {existing_srid}, New SRID {new_srid}") if args.overwrite: action = 'y' else: action = console.input(f"Reproject new data to SRID {existing_srid}? (y/n): ") if action.lower() == 'y': try: gdf = gdf.to_crs(epsg=existing_srid) logger.info(f"Reprojected new data to SRID {existing_srid}") except Exception as e: logger.error(f"[red]Error reprojecting data for '{table_name}': {e}[/red]") cursor.close() return None # Skip this file due to reprojection error else: logger.info(f"Skipping '{table_name}' due to CRS mismatch") cursor.close() return None # Skip this file else: logger.info(f"CRS is compatible for '{table_name}'") cursor.close() return gdf # Return the (possibly reprojected) GeoDataFrame

def create_spatial_index(conn, table_name, schema='public', geom_column='geom'): cursor = conn.cursor() # Get the actual geometry column name actual_geom_column = get_db_geometry_column(conn, table_name, schema=schema) or geom_column index_name = f"{schema}_{table_name}_{actual_geom_column}_idx" try: cursor.execute(f""" CREATE INDEX IF NOT EXISTS "{index_name}" ON "{schema}"."{table_name}" USING GIST ("{actual_geom_column}"); """) conn.commit() logger.info(f"Spatial index created on table '{schema}.{table_name}'") except Exception as e: logger.error(f"Error creating spatial index on '{schema}.{table_name}': {e}") conn.rollback() finally: cursor.close()

// Parallel processing implementation use rayon::prelude::*; // Parallel processing functionality use std::collections::HashMap; let cell_size = buffer_distance * 2.0; let buffer_distance_squared = buffer_distance * buffer_distance; // Create spatial grid for efficient neighbor searching let mut grid: HashMap<(i32, i32), Vec<(usize, (f64, f64))>> = HashMap::with_capacity(points.len() / 10); // Assign points to grid cells for (i, &(x, y)) in points.iter().enumerate() { let cell = ( (x / cell_size).floor() as i32, (y / cell_size).floor() as i32, ); grid.entry(cell) .or_insert_with(Vec::new) .push((i, (x, y))); } // Process grid cells in parallel let results: Vec<(usize, usize)> = grid.par_iter() .flat_map(|(&(cell_x, cell_y), cell_points)| { let mut local_results = Vec::new(); // Check current and neighboring cells let neighbor_cells = [ (cell_x, cell_y), (cell_x, cell_y + 1), (cell_x + 1, cell_y - 1), (cell_x + 1, cell_y), (cell_x + 1, cell_y + 1), ]; for &neighbor_cell in neighbor_cells.iter() { if let Some(neighbor_points) = grid.get(&neighbor_cell) { for &(i, (x1, y1)) in cell_points { for &(j, (x2, y2)) in neighbor_points { if i < j { let dx = x1 - x2; let dy = y1 - y2; let distance_squared = dx * dx + dy * dy; if distance_squared <= buffer_distance_squared && distance_squared > 0.0 { local_results.push((i, j)); } } } } } } local_results }) .collect();

// Python binding setup use pyo3::prelude::*; use std::collections::HashMap; use rayon::prelude::*; #[pyfunction] fn points_within_buffer_batch( points: Vec<(f64, f64)>, buffer_distance: f64 ) -> PyResult> { let cell_size = buffer_distance * 2.0; let buffer_distance_squared = buffer_distance * buffer_distance; let mut grid: HashMap<(i32, i32), Vec<(usize, (f64, f64))>> = HashMap::with_capacity(points.len() / 10); // Build spatial grid for (i, &(x, y)) in points.iter().enumerate() { let cell = ( (x / cell_size).floor() as i32, (y / cell_size).floor() as i32, ); grid.entry(cell) .or_insert_with(Vec::new) .push((i, (x, y))); } // Process points in parallel and find pairs within buffer distance let results = grid.par_iter() .flat_map(|(&(cell_x, cell_y), cell_points)| { let mut local_results = Vec::new(); let neighbor_cells = [ (cell_x, cell_y), (cell_x, cell_y + 1), (cell_x + 1, cell_y - 1), (cell_x + 1, cell_y), (cell_x + 1, cell_y + 1), ]; for &neighbor_cell in neighbor_cells.iter() { if let Some(neighbor_points) = grid.get(&neighbor_cell) { for &(i, (x1, y1)) in cell_points { for &(j, (x2, y2)) in neighbor_points { if i < j { let dx = x1 - x2; let dy = y1 - y2; let distance_squared = dx * dx + dy * dy; if distance_squared <= buffer_distance_squared && distance_squared > 0.0 { local_results.push((i, j)); } } } } } } local_results }) .collect(); Ok(results) } #[pymodule] fn rust_bindings(_py: Python, m: &PyModule) -> PyResult<()> { m.add_function(wrap_pyfunction!(points_within_buffer_batch, m)?)?; Ok(()) }