Engineering |

Hexagons, Hypertables, and 240 Dead Tags: Migrating a Maritime Data Platform to TimescaleDB

A hexagonal grid overlaid on an ocean chart, with vessel position dots clustered along shipping lanes

Every ship in the world is constantly shouting its name into the void. Its position, its heading, its speed, its destination — broadcast every few seconds via radio, picked up by satellites and shore stations, and funneled into databases that try to make sense of it all. At VesselAPI, we run one of those databases. And for the first year of our existence, we ran it on MongoDB. This is the story of why we stopped.

It's also a story about hexagons and a single mismatched struct tag that quietly broke an entire data pipeline.

The Shape of the Problem

AIS — the Automatic Identification System — is the backbone of maritime surveillance. Most commercial vessels are legally required to carry a transponder that broadcasts their identity and position — SOLAS mandates it for ships of 300 gross tonnage and upwards on international voyages, and many flag states extend the requirement further. The result is a firehose: at peak hours, we ingest roughly 700,000 position reports every sixty minutes. Each one is a point in space and time — latitude, longitude, timestamp, vessel identifier, plus speed, heading, and a handful of other fields. Position-in-time is the core of it.

If you squint at this data, it looks like a document. A position report has fields. You can serialize it as JSON. MongoDB will happily store it. And for the first few months, that was fine. We were building fast, the schema was changing daily, and MongoDB's flexibility was genuinely useful. Hard to have migration problems when there's nothing to migrate.

But here's the thing about vessel positions: they aren't documents. They're measurements. They have a timestamp and a location, and those two properties aren't just metadata — they're the entire point. The questions you ask of this data are fundamentally about time and space: Where was this ship two hours ago? What vessels are within 50 kilometers of Rotterdam right now? Show me everything that passed through the English Channel since Tuesday.

At the time, MongoDB had no native concept of any of this. (It has since added time-series collections, though they remain limited compared to purpose-built solutions.) It didn't understand that timestamps partition naturally into chunks, that old data expires, or that latitude and longitude define a point on a sphere where "within 50 kilometers" is a question with real mathematical structure. You can bolt on 2dsphere indexes and TTL policies, but you're fighting the grain of the database. And at 700,000 rows per hour, fighting the grain gets expensive fast.

What We Needed (and What Exists)

I wrote the requirements on a whiteboard one afternoon and stood back. Time-series ingestion at sustained throughput. Automatic partitioning by time. Compression of old data. Retention policies that don't involve cron. Spatial queries on a sphere. Full-text search. Relational joins. And ideally, something I could operate without a dedicated DBA. Looking at the list, I remember thinking: this is either one very specific database, or three separate ones duct-taped together.

I spent a week evaluating alternatives. InfluxDB handles time-series beautifully but its spatial support was experimental, living in Flux — which is now being deprecated in InfluxDB 3.0, taking the geo package with it. ClickHouse kept coming up in benchmarks but the operational overhead scared me, and PostGIS isn't an option there. MongoDB we already knew about.

TimescaleDB is PostgreSQL with a time-series engine bolted on at a level deep enough that it feels native. And because it is PostgreSQL, you get PostGIS for spatial queries, H3 for hexagonal indexing, GIN indexes for full-text search, and nearly thirty years of battle-tested relational database engineering. Turns out we didn't need three databases duct-taped together. We needed one.

We chose it. Then we had to figure out what "time-series thinking" actually means in practice.

Sounds interesting? Check out TimescaleDB

Data With a Shelf Life

The central abstraction in TimescaleDB is the hypertable. From the outside, it looks like a regular PostgreSQL table. You INSERT into it, you SELECT from it, you index it. But underneath, the data is automatically partitioned into chunks — contiguous slices of time, each stored as a separate physical table.

I didn't appreciate how much this changes until I stopped thinking about storage and started thinking about expiry.

Our vessel_positions hypertable uses 1-hour chunks. That means every hour of AIS data lives in its own self-contained partition. When we set a 78-hour retention policy, TimescaleDB doesn't scan through millions of rows looking for old records to delete — it just drops the chunks that have aged out. The entire partition disappears. It takes milliseconds.

16.5 million vessel positions, 78-hour retention, 1-hour chunks. The table is always roughly the same size, no matter how long the system runs.

Compression works the same way. After a chunk is two hours old — meaning we're no longer actively writing to it — TimescaleDB compresses it automatically. We segment the compression by MMSI (the vessel's radio identifier, broadcast in every AIS message) and order by timestamp descending. This means "give me the latest position for vessel X" can be answered from the compressed data without decompressing the entire chunk. The storage savings are substantial; the performance improvement for time-range queries is even better, because the query planner knows which chunks to skip entirely.

ALTER TABLE vessel_positions SET (
    timescaledb.compress,
    timescaledb.compress_segmentby = 'mmsi',
    timescaledb.compress_orderby = 'timestamp DESC'
);
SELECT add_compression_policy('vessel_positions', INTERVAL '2 hours');

In MongoDB, I had a cron job that ran a cleanup script every few hours. It failed silently for a week once and nobody noticed until disk usage alerted. In TimescaleDB, we just declare the retention policy and the database handles expiry itself. One fewer thing running at 3 AM that I have to worry about.

The Hexagon Problem

Here's a question that sounds simple: find all vessels within 100 kilometers of a given point.

PostGIS can answer this. You create a GIST spatial index on your geometry column, and the function ST_DWithin will find every point within a given distance. Under the hood, it uses the spatial index to eliminate obvious non-candidates via bounding box checks, then computes exact distances for the rest. It works. It's well-engineered.

But when your table has 16 million rows and new ones arrive at 12,000 per minute, "it works" isn't quite enough. The GIST index is good, but it still has to traverse a tree structure built on geometry — bounding boxes nested inside bounding boxes. For high-volume tables with constant inserts, this gets heavy.

So we added a layer in front of it. And that layer is made of hexagons.

H3 is a spatial indexing system originally developed at Uber for matching riders to drivers. It tiles the entire surface of the Earth with hexagons at multiple resolutions — coarser at low resolutions, finer at high ones. Every point on the planet falls inside exactly one hexagon at each resolution level, and each hexagon has a unique integer identifier.

We use resolution 5, where each hexagon has an edge length of roughly 9 kilometers and an area of roughly 253 square kilometers. The entire Earth is covered by about 2 million of these cells. Every vessel position, when it's inserted, gets a computed H3 cell stored alongside it:

h3_cell_res5 H3INDEX GENERATED ALWAYS AS (
    h3_lat_lng_to_cell(location::point, 5)
) STORED

That STORED keyword matters. The H3 cell is computed once, at insert time, and written to disk as an integer. No recalculation needed at query time. And because it's an integer, we can slap a plain B-tree index on it — the simplest, fastest index PostgreSQL knows how to build.

Now, when someone asks for vessels within 100 kilometers of a point, the query doesn't go straight to the spatial index. First, we compute which H3 cells overlap the search area — a quick geometric calculation that returns a handful of integer IDs. Then we filter the table to only rows matching those cell IDs, using the B-tree index. Integer equality. Blazing fast. This turns 16 million candidate rows into a few thousand.

Then PostGIS takes over, running ST_DWithin on the survivors for exact distance calculations. A few thousand rows through a precise spatial filter is trivial.

-- Stage 1: H3 pre-filter (integer comparison, B-tree)
h3_cell_res5 = ANY(ARRAY(
    SELECT h3_grid_disk(h3_lat_lng_to_cell(ST_MakePoint($1,$2)::point, 5), $6)
))
-- Stage 2: PostGIS exact filter (geometry, GIST)
AND ST_DWithin(location, ST_SetSRID(ST_MakePoint($1,$2), 4326)::geography, $3)
-- Stage 3: TimescaleDB chunk pruning (time range)
AND timestamp BETWEEN $4 AND $5

Three layers of filtering, each narrowing the candidate set for the next: H3 knocks it down from millions to thousands, PostGIS from thousands to hundreds, and chunk exclusion keeps you from scanning data outside the time window entirely.

Why hexagons, specifically? Because hexagons are the only regular polygon that tiles a plane with uniform adjacency — every neighbor shares an edge, and the distance from center to center is the same in every direction. Squares have diagonal neighbors that sit further away than edge neighbors, which distorts distance calculations. For spatial proximity queries, hexagons give you the least distortion. Uber didn't invent this insight — anyone who's looked at a honeycomb has seen it — but they did build a production-grade library around it.

We tried resolution 4 first. The cells were too big — a single cell covered so much ocean that the pre-filter wasn't filtering much. Resolution 6 was better spatially but generated too many cells per query, and the B-tree had to check them all. Resolution 5 was the one where a 100-kilometer radius query overlapped a manageable number of cells while still meaningfully shrinking the candidate set. We benchmarked it and moved on.

Feeding the Beast

700,000 position reports per hour means roughly 194 inserts per second, sustained. That's not a terrifying number for PostgreSQL — a well-tuned instance can handle far more — but the naive approach still hurts. Individual INSERT statements, each sent as a separate network round trip, spend more time in protocol overhead than actual writing. The database is fast; the network between your application and the database is not.

The obvious answer is PostgreSQL's COPY protocol, which streams raw row data in binary, bypassing the SQL parser entirely. We use it for vessel_eta and cache_ais_messages. But for vessel_positions, we can't. The reason is our own schema: the h3_cell_res5 generated column uses the H3INDEX type, and H3INDEX doesn't implement PostgreSQL's binary I/O functions. The COPY protocol requires binary serialization for every column type in the target table. No binary I/O, no COPY.

So we use pgx.Batch with SendBatch instead — the extended query protocol. It packs hundreds of parameterized INSERT statements into a single network round trip, and PostgreSQL executes them server-side without per-statement overhead. Not as fast as COPY, but an order of magnitude better than individual round trips:

batch := &pgx.Batch{}
for _, p := range positions {
    batch.Queue(
        `INSERT INTO vessel_positions
         (mmsi, imo, vessel_name, latitude, longitude, location,
          timestamp, processed_timestamp, suspected_glitch, ...)
         VALUES ($1, $2, $3, $4, $5,
                 ST_SetSRID(ST_MakePoint($5, $4), 4326),
                 $6, $7, $8, ...)`,
        p.MMSI, imo, p.VesselName,
        p.Latitude, p.Longitude,
        p.Timestamp, p.ProcessedTimestamp,
        p.SuspectedGlitch, /* ... */
    )
}
results := pool.SendBatch(ctx, batch)

Notice that the geometry is computed server-side with ST_SetSRID(ST_MakePoint(...)). I briefly considered pre-computing EWKB in the application to avoid calling a function 700,000 times per hour. But since we were already on SendBatch rather than COPY, and ST_MakePoint is cheap server-side, the optimization wasn't worth the added complexity. Sometimes the schema you designed to make reads fast makes writes slightly harder. I'd make that trade-off again.

The Character That Broke the Ports

A port at night with container cranes lit against the dark sky, overlaid with a code error showing mismatched struct field names

When your database speaks JSON but your structs still think in BSON.

Here is a bug that could only exist in a migration.

Our port data comes from two external sources. One is a scraper that crawls MyShipTracking's sitemap and extracts basic port info — name, country, coordinates, UN/LOCODE. The other is the World Port Index, maintained by the US National Geospatial-Intelligence Agency, which provides detailed harbor characteristics: depths, pilotage requirements, tug availability, dozens of operational fields.

Neither source writes directly to the production ports table. Instead, they dump raw JSON documents into staging tables — cache_port_mst and cache_port_wpi — and a consolidation step merges them. MST provides breadth (6,488 ports), WPI provides depth (~3,700 ports with rich metadata). The consolidator joins them on UN/LOCODE and writes the merged result to production. Clean separation of concerns.

After the migration went live, the MST scraper worked perfectly. 6,488 ports in the staging table. But the WPI staging table was empty. Zero rows. The consolidator, finding nothing to merge, produced nothing. The production ports table: empty. And because port events rely on the ports table for UN/LOCODE enrichment, 156,000 port events were created with no geographic identifier. Everything looked healthy from the outside. The data was garbage.

The root cause was one word.

The WPI port struct, a holdover from the MongoDB era, still carried dual serialization tags — something like:

type WpiPort struct {
    UnloCode string `bson:"unlo_code" json:"unloCode"`
    // ...
}

The staging table upsert function works by marshaling each struct to JSON, then extracting a key field by name. The key parameter was "unlo_code" — the BSON tag name, used by MongoDB's driver. But json.Marshal uses the JSON tag: "unloCode". The function looked for a field called unlo_code in the JSON document, found nothing, and returned an error. Not a silent error, technically — the function threw a clear "key field not found" message. But without alerting on the nightly WPI sync job, a returned error that nobody checks is as good as silent. It ran, it failed, it failed again, every night at 1 AM, for days.

The fix was changing one string: "unlo_code" to "unloCode".

But the real fix was broader. I searched the codebase and found nearly 240 leftover bson:"..." tags scattered across the data contract structs. The MongoDB driver wasn't even imported anymore — these tags were pure vestigial code, left over from the old world. Every one of them was a potential version of the same bug: a name from a system that no longer existed, waiting to be confused with a name from the system that did.

I spent two days on this. Two days staring at logs, convinced the WPI API had changed its response format, before I thought to check the struct tags. It's a naming collision between two eras of the same system, and Go is happy to let it happen — struct tags are opaque strings the compiler ignores completely. No linter will save you. You have to notice it yourself, or wait for production to notice it for you.

What the Numbers Look Like

Production table statistics after migration
TableTypeRowsNotes
vessel_positionsHypertable16.5M~700K/hour, H3 + GIST + B-tree indexes
vessel_etaHypertable78h retention, compressed
port_eventsHypertable156KDedup index, UN/LOCODEs empty (pre-fix)
cache_ais_messagesHypertableRaw AIS buffer, 12h retention
vesselsRegular~50KConsolidated from multiple sources
portsRegular~6,500After WPI fix; 0 before
light_aidsRegular35,237Navigation infrastructure
dgps_stationsRegular163DGPS reference stations
navtex_messagesRegular~93/dayDeduplicated by content hash

All of this runs on a single EC2 r7i.large — 2 vCPUs, 16 GB of RAM. I keep expecting to need to upgrade and I keep not needing to. PostgreSQL with the right extensions, doing the work that previously required MongoDB plus a constellation of application-level workarounds for everything MongoDB couldn't do natively.

A bulk carrier navigating a Norwegian fjord at dawn, with a faint AIS data trail arcing behind it

A bulk carrier transiting a Norwegian fjord — one of roughly 700,000 position reports we process every hour.

What I'd Tell You at a Bar

Look, MongoDB was the right call when we started. I'd choose it again for that stage. The mistake was staying six months too long — past the point where the data had obviously hardened into a shape and we were just too busy to deal with it.

Honestly, moving the data was the easy part. The hard part — the part that's still ongoing — is finding all the places where the old system's assumptions are embedded in the code. A struct tag referencing a serialization format you don't use anymore. A key parameter someone copied from a different struct's bson tag. Those things survive the migration and sit there until they don't.

We're still finding bson tags. Probably will be for a while.

← Back to all posts