How I Used Snowflake Cortex Code to Build an Automated LEGO Data Pipeline (and Write this Blog Post)

Introduction

I wanted to find the rarest LEGO minifigs in existence — the ones that appeared in a single set, in a quantity of one, decades ago. To get there, I needed a proper data pipeline: paginated API ingestion, raw JSON landing in Snowflake, modelled views, and idempotent reloads.

I built the entire thing in a single session using Snowflake's Cortex Code as an agentic AI pair programmer. I didn't write a line of code by hand. I described what I wanted, made architectural decisions when asked, and authenticated when prompted. Cortex Code handled the API research, pagination logic, MERGE statements, validation, and even pushed the finished repo to GitHub. This blog post is the final artefact — also authored by Cortex Code as the last step of the pipeline.

Security Infrastructure First

Before any data touched the wire, I established guardrails. The project uses AGENTS.md as a contract between me and the AI — four lines that enforce: the API key comes from the REBRICKABLE_API_KEY environment variable only, the .env file must never be read directly, and .env-example is the reference for required variables.

But instructions alone aren't enforcement. I added a PreToolUse hook (.cortex/hooks/block-env-read.ps1) that intercepts every tool call before execution. It inspects the file_path, path, command, and pattern fields of incoming tool inputs and blocks any operation that matches .env:

if ($p -match '(^|[\\/])\.env($|[\\/]|["\s])' -or $p -match '\.env$') {
    $result = @{ decision = "block"; reason = "Reading .env is not allowed." }
    Write-Output ($result | ConvertTo-Json)
    exit 2
}

This runs on a wildcard matcher (.*) — meaning every single tool call passes through the guard. The AI cannot accidentally cat, grep, or read the secrets file. It's a hard block, not a soft suggestion.

The Ingestion Pipeline

The pipeline targets four Rebrickable API endpoints: /colors/, /themes/, /sets/, and /minifigs/. Each returns paginated JSON with a next URL pattern and a free-tier rate limit of approximately 1 request per second.

The architecture lands raw JSON as Snowflake VARIANT in a Bronze/RAW layer (RAW_COLORS, RAW_THEMES, RAW_SETS, RAW_MINIFIGS), each with a RECORD_KEY column holding the stable identifier extracted from the payload (id for colors/themes, set_num for sets/minifigs).

Idempotency is handled via MERGE. Each run stages incoming records into a temporary table, then executes:

MERGE INTO RAW_SETS AS tgt
USING RAW_SETS__STAGING AS src ON tgt.RECORD_KEY = src.RECORD_KEY
WHEN MATCHED THEN UPDATE SET tgt.RAW_JSON = src.RAW_JSON, tgt.LOADED_AT = CURRENT_TIMESTAMP()
WHEN NOT MATCHED THEN INSERT (RECORD_KEY, RAW_JSON, LOADED_AT) VALUES (...)

I proved this works by running the pipeline twice. Second run output: MERGE: 0 inserted, 27069 updated for sets — row counts unchanged, no duplicates.

Live Context with MCP

Cortex Code used Git (via GitHub Desktop's bundled binary) to initialise the repo locally, commit the pipeline code, and push to GitHub using the credential manager. The finished repository is at github.com/Martin-Regan/rebrickable-pipeline — complete with README, .gitignore, .env-example, and a clean main branch containing only the API pipeline.

An analysis branch holds the supplementary bulk CSV ingestion and the fact view, keeping concerns separated.

Data Modelling and Analytical Defence

On top of the raw VARIANT tables sit typed views (DIM_COLORS, DIM_THEMES, DIM_SETS, DIM_MINIFIGS) that parse JSON into proper columns. Before writing any analytical query, I had Cortex Code run DESCRIBE VIEW against every modelled table to confirm column names and types actually exist. This prevents a common AI failure mode: hallucinating column names that sound right but don't exist.

For rarity analysis, I supplemented the API data with Rebrickable's bulk CSV downloads (inventories.csv.gz and inventory_minifigs.csv.gz) and built a Gold-layer fact view, FCT_MINIFIG_APPEARANCES, pre-joining the full chain: minifig → inventory → set → theme.

Rarity definition: a minifig appearing in exactly 1 unique set with a total quantity of 1, sorted by oldest release year. The most surprising finding wasn't about Star Wars — it was that LEGO Friends has 747 "rare" minifigs from only 249 characters. Stephanie alone has 65 exclusive outfit variants. Friends generates rarity through wardrobe changes; Star Wars generates it through universe breadth. The 48 minifigs from 1975 — the original LEGOLAND figures — are the true collectibles.

Future Considerations

With more time I'd explore matching the database rarity metric against secondary market pricing from BrickLink to answer: does "appears once in 1975" actually correlate with high collector value? I'd also ingest inventory_parts.csv to extend the analysis to rare bricks, not just figures — and potentially build a Streamlit app on top for interactive exploration.

Notes from the human

As noted earlier, the entire data pipeline project and this blog post were both created by Cortex Code's AI. My role was to define what I wanted, dictate how I wanted it done, and ensure best practices were followed.

This workflow was exponentially faster than if I had created the data pipeline entirely by hand. It was a wonderful learning experience that really showed me the power of using this type of tool to drastically speed up data engineering workflows. While it took a little bit of coaxing to get the exact results I wanted, with this type of technology improving so quickly, I'm incredibly excited to see what comes next.

Check out the pipeline on my github here.

Author:

Martin Regan

View Profile