Configuration¶
Configuration is layered:
- Bundled defaults at
src/oex/defaults/base.yaml. - A user YAML (passed via
--configor auto-located atconfigs/<iso3>.yaml). - CLI overrides (
--iso3,--hdx-push,--output-dir, etc).
Layers merge with OmegaConf, with one
deviation: when a user YAML provides a categories list, it replaces the
default list rather than element-wise merging.
Mental model: filter, where, select¶
selectlists the columns to keep in the output. Same on both sources. Pure SQL expressions, e.g.names.primary AS name(Overture) ortags['name'] AS name(OSM).whereis an optional SQL filter on the rows that come out of the source. Combined with the implicit country bbox and boundary clip.filter(OSM only) is the OSM tag filter passed to quackosm at parquet build time. It decides which OSM features ever land in the cache. This is the OSM equivalent of a row filter, but it runs during PBF to GeoParquet conversion, not at DuckDB query time. Most categories only needfilter;osm.wherestays empty.
For Overture there is no filter: the upstream (theme, feature_type) pair
already names a partitioned subset, so where is the only filter.
Boundary¶
boundary:
geom: null # optional inline GeoJSON (string)
geoboundaries_release: CGAZ
geoboundaries_level: ADM0
buffer_meters: 0 # outward buffer in metres (0 = off)
If geom is set (a GeoJSON string), it overrides the geoBoundaries lookup;
otherwise the boundary comes from geoBoundaries CGAZ ADM0 for the ISO3.
buffer_meters is an outward buffer applied to whatever boundary you end up
with. The geometry is reprojected from EPSG:4326 to EPSG:3857, buffered by
the given metre value, then reprojected back. 0 disables it. Use this for
coastal countries or for cross-border features whose centroid sits a few
hundred metres outside the legal boundary (jetties, bridges, airfields).
A note on engines: the buffer drives the polygon used to extract the
country PBF. For source.osm.engine: geofabrik the country PBF comes
pre-cut from Geofabrik (a small overlap beyond the legal border is
included, but it's bounded). Switch to engine: planet if you need a
larger buffer than Geofabrik's slice; the buffered polygon is passed
straight to osmium extract. Overture is not affected; it reads from
the global S3 bucket.
Output formats¶
output.formats (or per-category formats) accepts any subset of:
| Format | Notes |
|---|---|
gpkg |
GeoPackage. Single file, all geometry types together. Recommended default. |
shp |
ESRI Shapefile. Split by geometry type. Field names truncated to 10 chars. |
geojson |
Single-file text. Easy to inspect, can be large. |
kml |
Opens in Google Earth and most desktop GIS. Single XML file; prefer gpkg above ~1M features. |
Default is [gpkg, shp].
Top-level keys¶
| Key | Type | Notes |
|---|---|---|
iso3 |
string | Required. ISO3 country code. |
key |
string | Required. Slug for HDX dataset names. |
dataset_name |
string | null | Pretty country/region name in HDX titles. |
subnational |
bool | Sets HDX subnational flag. |
frequency |
string | HDX expected update frequency. |
boundary |
block | See BoundaryConfig. |
output |
block | Output directory and format list. |
parallel |
block | DuckDB threads + memory + thread pool toggle. |
duckdb |
block | http retry/timeout, temp dir, object cache. |
logging |
block | level, format string. |
hdx |
block | HDX site, push toggle, credentials. |
source |
overture + osm |
Per-source settings (release, cache dir). |
categories |
list | Per-theme name, hdx, overture, osm. |
Categories¶
Each category carries:
name: human label (also used as the HDX dataset suffix and as the OSM cache parquet filename).formats: optional override foroutput.formats.hdx: title, notes, tags, license, license_url, caveats.overture:enabled,theme,feature_type,select,where.osm:enabled,filter,select,where.
Both overture.select and osm.select are pure SQL fragments. The geometry
column is appended automatically.
OSM source schema (what your SELECT runs against)¶
The OSM cache produced by quackosm has three columns:
| Column | Type | What it is |
|---|---|---|
feature_id |
VARCHAR |
OSM type + id, e.g. node/12345 |
tags |
MAP<VARCHAR, VARCHAR> |
All retained OSM tags as key->value |
geometry |
geometry | POINT, LINESTRING, POLYGON, MULTIPOLYGON, ... |
tags is MAP<VARCHAR, VARCHAR>. Access a key with tags['name'] AS name; returns NULL when the key is absent. Refer to the
OSM tag wiki for what
keys exist; common ones include building, highway, amenity,
waterway, landuse, place, aeroway, railway, name, name:en,
addr:*, source.
osm.filter accepts the quackosm tag-filter shape:
osm:
filter:
building: true # any value of `building`
highway: ["primary", "secondary"] # only these values
amenity: ["hospital", "clinic"]
Overture source schema (what your SELECT runs against)¶
Overture publishes parquet at s3://overturemaps-us-west-2/release/<release>/theme=<theme>/type=<feature_type>/.
Each (theme, feature_type) has a documented column set. The current release
exposes:
| Theme | Feature type | Notable columns |
|---|---|---|
addresses |
address |
id, country, postcode, street, number, unit |
base |
bathymetry |
id, depth |
base |
infrastructure |
id, names, subtype, class |
base |
land |
id, names, subtype, class |
base |
land_cover |
id, subtype, cartography.{min,max}_zoom |
base |
land_use |
id, names, subtype, class, surface |
base |
water |
id, names, subtype, class, is_salt, wikidata |
buildings |
building |
id, names, class, subtype, height, num_floors, roof_* |
buildings |
building_part |
id, height, num_floors |
divisions |
division |
id, names, subtype, country, region, population, wikidata |
divisions |
division_area |
id, names, subtype, country, region |
divisions |
division_boundary |
id, subtype, class |
places |
place |
id, names, categories, addresses, phones, websites |
transportation |
connector |
id |
transportation |
segment |
id, names, class, subclass, subtype, road_surface |
For the authoritative schema (including types and nested struct shapes),
see the Overture Maps schema reference.
Note that types are renamed across releases (e.g. boundary became
division_boundary in 2026-04-15.0), so pin source.overture.release if
your config relies on a specific schema.
Parallel and memory¶
parallel:
enabled: true
threads: null # null = adaptive: always 1 worker
memory_gb: null # null = adaptive: 60% of total RAM as DuckDB memory limit
threads: null and memory_gb: null both use adaptive defaults. The adaptive
logic always runs one DuckDB worker (DuckDB already parallelises every
operation internally) and allocates 60% of total RAM to the DuckDB memory
limit. This leaves headroom for GDAL writes and the GEOS boundary fallback
during pcode tagging.
Inside Docker, set OEX_MEMORY_GB to the container's --memory value so the
adaptive calculation uses the container limit rather than host RAM:
Pcode tagging¶
source:
pcodes:
enabled: true # default false; true in the HOT schema
levels: [1, 2, 3, 4]
cache_dir: data/pcodes
boundary_resolution: geos # or 'h3_neighbor'
When enabled, every feature gets six extra columns:
| Column | Example value | What it is |
|---|---|---|
adm0_pcode |
NPL |
ISO3 country code |
adm0_name |
Nepal |
Country name |
adm1_pcode |
NP-BA |
First subdivision pcode |
adm1_name |
Bagmati |
First subdivision name |
adm2_pcode |
NP-BA-KA |
Second subdivision pcode |
adm2_name |
Kathmandu |
Second subdivision name |
adm3 and adm4 columns are added at the same levels; many countries have
null values at those levels.
Pcode data comes from fieldmaps.io edge-matched humanitarian boundaries, downloaded once and cached locally.
Boundary resolution¶
The H3 hash join covers 95-99% of features. For the remainder, whose H3 centroid
cell isn't owned by any admin (they sit on the seam between two adjacent admin
polygons), boundary_resolution picks the strategy:
| Value | What it does | Trade-off |
|---|---|---|
h3_neighbor |
Look up the 6 neighbour H3 cells; assign the admin that the most neighbours belong to. | Pure hash join, memory-bounded. Up to ~5 km of slack at admin borders. |
geos |
ST_Contains(admin_geom, centroid) against the admin polygon. |
Correct to the metre. Spatial nested-loop join; can OOM on large countries (e.g. CHN at 20 GB). |
Default is geos. The bundled defaults and the HOT schema set
boundary_resolution: h3_neighbor on high-cardinality categories (buildings,
roads, waterways, land use, rivers) so big countries don't OOM; smaller
categories inherit geos for precise borders.
categories:
- name: Buildings
boundary_resolution: h3_neighbor # memory-safe for millions of features
- name: Schools
# inherits geos
Custom schemas¶
Three ways to plug in your own category set:
- Inline
categories:in your country YAML (replaces defaults wholesale). categories_file: path/to/schema.yamlon the country YAML.- Both:
categories_fileloads the base set, then an inlinecategories:block overrides for that country.
Each category needs name, plus any of:
formats(list): override the globaloutput.formatsfor this category.hdx: HDX metadata (title, notes, tags, license, license_url, caveats).overture:theme,feature_type,select(SQL),where(SQL).osm:filter(quackosm tag filter),select(SQL),where(SQL).
OSM source: engines¶
source:
osm:
engine: geofabrik # or planet
cache_dir: data/osm
snapshot: latest
geofabrik_clip_to_boundary: true
pbf_path: null # required for engine: planet or planet_fallback
planet_fallback: false # try geofabrik, fall back to planet on 404
auto_download_planet: false # when true, download the planet PBF if pbf_path is missing
geofabrik (default): no pre-build. First run per country downloads the
country PBF from Geofabrik and runs quackosm once per category. Cache layout:
<cache_dir>/geofabrik/<iso3>/<snapshot>/<category-slug>.parquet.
planet: clips a country PBF out of a local planet PBF using
osmium extract --strategy=complete_ways, then runs quackosm once with
the union of all category tag filters and keep_all_tags=True. Per-category
extraction at query time is a tag-predicate WHERE on the resulting
<cache_dir>/planet/<iso3>/<snapshot>/country.parquet. Requires the
osmium-tool binary on PATH (one-time dnf install osmium-tool /
apt install osmium-tool / brew install osmium-tool).
planet_fallback: true: keeps engine: geofabrik as primary and only
switches to the planet path when Geofabrik does not publish the country
(e.g. some small territories). Other Geofabrik failures (network errors,
rate limits) are not swallowed.
Download the planet PBF once with oex-cli osm-build-cache; the result
lives at <cache_dir>/_pbf/planet-latest.osm.pbf and you point
pbf_path at it.
Pinning a release / snapshot¶
Both sources resolve to the latest data by default. Pin a specific version in the country YAML when you need a reproducible run:
source:
overture:
release: 2026-04-15.0 # default 'latest' -> resolved from S3
osm:
snapshot: 2026-05-01 # default 'latest'
Resolution rules:
- Overture
release: any literal release like2026-04-15.0is used verbatim with no lookup.latestlists the public S3 bucket and picks the highestYYYY-MM-DD.N. - OSM
snapshotforplanet: defaults to the planet PBF's mtime as an ISO date. An explicit value pins the cache directory name; subsequent runs reuse<cache>/planet/<iso3>/<snapshot>/country.parquetwithout reclipping the planet. - OSM
snapshotforgeofabrik: this is a label for the per-country cache dir. Geofabrik only publishes*-latest.osm.pbfURLs (no historical archive), so a fresh build always pulls today's PBF regardless of the label. To truly pin an OSM date, useengine: planetwith a planet PBF downloaded on the date you want.
The resolved version is logged before any per-category work, e.g.:
Overture source: release=2026-04-15.0 bucket=overturemaps-us-west-2
OSM source: geofabrik IND, snapshot=2026-05-07, cache=...
It also lands inside every zip's README.txt (Source, Snapshot fields).
HDX publication¶
HDX push is off by default. Enable per run:
hdx:
push: true
site: prod # or 'demo'
api_key: ${oc.env:HDX_API_KEY}
owner_org: your-org
maintainer: your-username
user_agent: my-pipeline/1.0 # optional, defaults to oex
Each category supplies its own HDX metadata block:
- name: Buildings
hdx:
title: Buildings of Nepal
notes: |
Building footprints from Overture (OSM + Microsoft + Google + Esri)
and OpenStreetMap.
tags: [buildings, geodata]
license: hdx-odc-odbl # or a free-form license string
license_url: https://opendatacommons.org/licenses/odbl/1-0/
caveats: Verified at the community level only.
dataset_source: OpenStreetMap contributors # optional override
dataset_source is the value HDX displays under "Source" on the dataset
page. When unset, the runner supplies a default like
OpenStreetMap (Geofabrik IND 2026-05-07) for OSM exports or
Overture Maps Foundation 2026-04-15.0 for Overture. Override it to match
your organisation's standard (HOT-OSM uses the verbatim string
OpenStreetMap contributors).
When both overture and osm are enabled for a category, both sources
contribute resources to the same HDX dataset (one zip per source per format).