Skip to main content

Overview

The PDAL Worker is a specialized service for processing point cloud and geospatial data using PDAL (Point Data Abstraction Library) and GDAL. It handles computationally intensive tasks like LiDAR processing, format conversion, and spatial analysis.

Technology Stack

  • Core Library: PDAL for point cloud processing
  • Spatial Library: GDAL for raster and vector data
  • Runtime: Node.js with worker threads
  • Processing: Native C++ libraries via bindings
  • Storage: Local filesystem with cloud storage support

Key Features

Point Cloud Processing

  • Format Conversion: LAS, LAZ, PLY, and other point cloud formats
  • Filtering: Classification, noise removal, ground extraction
  • Transformation: Coordinate system conversion, scaling, rotation
  • Analysis: Density calculation, height statistics, feature extraction

Raster Processing

  • Format Support: GeoTIFF, JPEG2000, ECW, and other raster formats
  • Reprojection: Coordinate system transformations
  • Mosaicking: Combining multiple raster datasets
  • Analysis: Terrain analysis, vegetation indices, change detection

Vector Processing

  • Format Conversion: Shapefile, GeoJSON, GML, and other vector formats
  • Geometry Operations: Buffering, intersection, union operations
  • Attribute Processing: Field calculations and data manipulation
  • Spatial Analysis: Proximity analysis, overlay operations

Architecture

Processing Pipeline

Project Structure

packages/worker/src/
├── core/
│   ├── FileResolver.ts     # Input file handling
│   ├── Processor.ts        # Main processing logic
│   └── types.ts           # TypeScript definitions
├── utils/
│   └── withCatch.ts       # Error handling utilities
├── GdalEngine.ts          # Raster processing engine
├── GdalProcessor.ts       # GDAL operations
├── PdalProcessor.ts       # PDAL operations
└── index.ts              # Service entry point

Processing Engines

PDAL Engine

Handles point cloud data with operations like:
  • Readers: LAS, LAZ, PLY, PCD file formats
  • Writers: Output to various point cloud formats
  • Filters: Classification, decimation, outlier removal
  • Pipelines: Chain multiple operations together

GDAL Engine

Handles raster and vector data with operations like:
  • Raster: GeoTIFF processing, reprojection, mosaicking
  • Vector: Shapefile operations, geometry processing
  • Analysis: Terrain analysis, spatial statistics

Configuration

Environment Variables

# Service configuration
NODE_ENV=development
PDAL_SERVER_PORT=3002

# Storage paths
GEOFLOW_DATA_PATH=/app/storage/data
GEOFLOW_TEMP_PATH=/app/storage/temp

# Processing limits
MAX_FILE_SIZE=1GB
MAX_CONCURRENT_JOBS=4
TIMEOUT_MINUTES=30

Docker Configuration

geoflow-worker:
  build:
    context: .
    dockerfile: packages/worker/Dockerfile
  environment:
    - NODE_ENV=development
    - PDAL_SERVER_PORT=3002
  volumes:
    - ./storage/data:/app/storage/data:rw
    - ./storage/temp:/app/storage/temp:rw
  ports:
    - "3002:3002"

API Endpoints

Processing Jobs

// Start point cloud processing
POST /api/process/pointcloud
{
  "input": "data/input.laz",
  "output": "data/output.las",
  "pipeline": {
    "filters": [
      {"type": "filters.classification", "classification": 2},
      {"type": "filters.outlier"}
    ]
  }
}

// Start raster processing
POST /api/process/raster
{
  "input": "data/input.tif",
  "output": "data/output.tif",
  "operation": "reproject",
  "targetSRS": "EPSG:4326"
}

Job Management

// Get job status
GET /api/jobs/:jobId

// Cancel job
POST /api/jobs/:jobId/cancel

// List active jobs
GET /api/jobs

Development

Local Development

# Install dependencies
bun install

# Start development server
bun run dev

# Run tests
bun test

# Build for production
bun run build

Testing Processing

# Test PDAL installation
pdal info --input data/sample.laz

# Test GDAL installation
gdalinfo data/sample.tif

# Run processing test
curl -X POST http://localhost:3002/api/process/test

Performance Optimization

Memory Management

  • Streaming Processing: Large files processed in chunks
  • Temporary Files: Intermediate results stored efficiently
  • Resource Limits: Configurable memory and CPU limits
  • Cleanup: Automatic cleanup of temporary files

Parallel Processing

  • Worker Threads: Multiple processing threads for CPU-intensive tasks
  • Job Queue: Asynchronous job processing with prioritization
  • Load Balancing: Distribute work across multiple worker instances
  • Resource Monitoring: Track CPU, memory, and disk usage

Caching Strategies

  • Result Caching: Cache frequently used processing results
  • Pipeline Optimization: Reuse optimized processing pipelines
  • Format Detection: Cache file format information

Supported Formats

Point Cloud Formats

  • LAS/LAZ: Standard LiDAR formats
  • PLY: Polygon file format
  • PCD: Point Cloud Data format
  • TXT: ASCII point cloud data
  • E57: ASTM E57 file format

Raster Formats

  • GeoTIFF: Georeferenced TIFF
  • JPEG2000: Compressed raster format
  • ECW: Enhanced Compression Wavelet
  • MrSID: Multi-resolution Seamless Image Database
  • NetCDF: Network Common Data Form

Vector Formats

  • Shapefile: ESRI shapefile format
  • GeoJSON: JSON-based geospatial format
  • GML: Geography Markup Language
  • KML: Keyhole Markup Language
  • GPX: GPS Exchange Format

Error Handling

Processing Errors

  • File Corruption: Detect and report corrupted input files
  • Format Mismatch: Validate input file formats
  • Processing Failures: Detailed error messages with recovery suggestions
  • Timeout Handling: Automatic job cancellation on timeouts

Recovery Strategies

  • Retry Logic: Automatic retry for transient failures
  • Partial Results: Save intermediate results on failures
  • Rollback: Clean up failed processing attempts
  • Logging: Comprehensive error logging for debugging

Monitoring & Metrics

Health Checks

  • Service Health: HTTP endpoint for load balancer health checks
  • Processing Status: Real-time job status monitoring
  • Resource Usage: CPU, memory, and disk usage tracking
  • Queue Status: Job queue length and processing rates

Logging

  • Structured Logs: JSON-formatted logs for analysis
  • Log Levels: Configurable verbosity (debug, info, warn, error)
  • Performance Logs: Processing time and resource usage
  • Error Tracking: Automatic error aggregation and alerting

Security Considerations

File Access

  • Path Validation: Prevent directory traversal attacks
  • File Type Validation: Strict input file type checking
  • Size Limits: Configurable maximum file sizes
  • Quarantine: Suspicious files isolated for inspection

Processing Security

  • Sandboxing: Processing runs in isolated environments
  • Resource Limits: Prevent resource exhaustion attacks
  • Input Sanitization: Validate all processing parameters
  • Access Control: API authentication and authorization

Troubleshooting

Common Issues

PDAL/GDAL Not Found: Ensure native libraries are properly installed in Docker image Memory Exhaustion: Increase Docker memory limits or reduce concurrent jobs File Permission Errors: Check volume mount permissions in Docker Processing Timeouts: Increase timeout limits for large files

Debug Commands

# Check PDAL version
pdal --version

# Validate pipeline
pdal pipeline --validate pipeline.json

# Test file info
pdal info input.laz

# View processing logs
docker compose logs geoflow-worker