TechReaderDaily.com
TechReaderDaily
Live
Data Infrastructure · Analysis

DuckDB's Architecture Reveals: A Data Warehouse Is a Behavior

With lakehouse architectures converging, DuckDB's embeddable OLAP engine is proving that a data warehouse is not a product category but a behavior, reshaping how the industry thinks about metadata, storage, and compute.

In-Process Analytical Data Management with DuckDB - InfoQ bing
In this article
  1. What the benchmarks actually measure
  2. What breaks under failure

In February 2024, the DuckDB team released version 1.0.0 with a quiet change note that almost nobody noticed at the time: the duckdb_databases pragma had been deprecated. The function had never been heavily used — it let you enumerate database files attached to a running session — and its removal was unremarkable on its face. But that deprecation was a small signal of a much larger architectural bet. DuckDB had long ago abandoned any pretense of multi-database coordination inside a single process, choosing instead to optimize for something narrower and, as it turns out, far more disruptive: a single-node, single-database analytical engine that you embed directly into your application, your laptop, your CI pipeline. No server. No daemon. No separate cluster to manage. The deprecation of duckdb_databases, arcane as it sounds, encoded the project's thesis — that the future of analytical processing would not look like a warehouse at all.

Two years later, it's no longer clear where the warehouse ends and the engine begins. The architecture that dominated analytical data for three decades — a dedicated server running a columnar database, surrounded by ETL pipelines, fronted by a SQL interface, and accessed over the network by clients — has spent the last five years dissolving from all directions. On one side, the lakehouse movement, championed by Databricks and now increasingly adopted by Snowflake's own catalog-centric strategy, collapses the distinction between data lake and data warehouse by storing data in open table formats like Apache Iceberg while layering warehouse-grade performance and ACID guarantees on top. On the other side, in-process OLAP engines like DuckDB, chDB (a ClickHouse derivative), and Polars ask a more radical question: what if the analytical database is just a library you link against? The answer, increasingly, is that you get something that behaves exactly like a warehouse — SQL, columnar scans, predicate pushdown, vectorized execution — without any of the operational machinery. Duck typing, in the Pythonic sense: if it quacks like a warehouse, it is one.

To understand how we arrived at this convergence, you have to go back to a fork in the road that the database industry took in the mid-2000s. The first decade of data warehousing had been dominated by purpose-built appliances — Teradata, Netezza, Greenplum — that married compute and storage in tightly coupled, proprietary enclosures. These systems were expensive, fast for their time, and operationally rigid. Then Hadoop happened. The Hadoop ecosystem, for all its flaws, demonstrated something genuinely new: that you could store exabytes of raw data on commodity hardware in open formats like Parquet and ORC, and run analytical jobs across it without first loading it into a warehouse. The problem was that Hadoop's execution engine — MapReduce, and later Hive on Tez — was catastrophically slow for interactive queries. This gap gave rise to what the database researcher Michael Stonebraker, in a series of papers beginning in 2007, described as the "OLAP engine" category: columnar, vectorized, single-node-fast systems like Vertica and Vectorwise that could scan billions of rows in seconds. They were warehouses stripped to their analytical core. But they were still servers you connected to.

The lakehouse, first articulated by Databricks in 2020 and now the subject of a recent acquisition frenzy — SAP announced it would acquire Dremio for an undisclosed sum in early May 2026 — represents the synthesis of those two threads. A lakehouse keeps data in open formats (Parquet, Iceberg, Delta Lake) on object storage, which gives you the cost profile and scalability of a data lake, but it adds a transactional metadata layer, a query optimizer, and a SQL interface that together deliver the interactive performance and governance of a warehouse. The practical upshot: you no longer have to copy data from a lake into a warehouse to run fast queries. The warehouse dissolves into a set of capabilities layered over open storage. Databricks and Snowflake are now racing to own that metadata and catalog layer, because whoever controls the catalog controls the query routing, the governance, the access policies — and, not incidentally, the customer lock-in. "The warehouse is the catalog" has become the quiet operating principle of the 2026 data stack.

The twist that DuckDB introduces is to invert this entire model, rejecting the idea that an analytical engine needs to be a service at all. DuckDB is an embeddable, in-process OLAP database — a single library, shipped as a C++ shared object with bindings for Python, R, Node.js, Java, and Rust — that reads and writes Parquet, CSV, JSON, and Iceberg directly. It runs inside your application's memory space. It has no network protocol, no authentication layer, no cluster manager. Install it with pip install duckdb and you have a fully functional analytical SQL engine on your laptop that can scan a 50 GB Parquet dataset faster than most people's production Redshift clusters. This is not a toy. Last year, the team demonstrated a "bring your own compute" model that lets users point DuckDB at an S3 bucket full of Iceberg tables, run queries that rival Snowflake's performance on the same data, and pay only for the compute they actually use — which is to say, the EC2 instance their application already runs on. No separate warehouse bill. No idle cluster costs.

The "duck typing" of warehouses describes exactly this phenomenon. A Python programmer who needs to analyze a Parquet dataset does not write ETL to load it into BigQuery; she writes duckdb.sql("SELECT … FROM 's3://bucket/data/*.parquet'") and gets results in a Pandas DataFrame. A data engineer debugging a pipeline does not spin up a Spark cluster; she runs DuckDB inside a Jupyter notebook against production data on S3, with predicate pushdown that ensures only the relevant row groups are fetched over the network. The behavior is indistinguishable from what you'd expect from a cloud data warehouse — fast SQL analytics on semi-structured data at scale — but the architecture is radically simpler. Or rather, the complexity hasn't disappeared; it's been absorbed into the query optimizer, the buffer manager, and the Parquet reader, all of which live inside a single library that compiles to roughly 15 megabytes. The operational surface area of a warehouse — provisioning, scaling, patching, monitoring — simply does not exist.

None of this is to suggest that DuckDB replaces Snowflake or Databricks for the workloads those systems were built to serve. A single-node, in-process engine cannot run a 200-way join across 10 petabytes of telemetry data with sub-second latency. It cannot manage concurrent access from 500 analysts while enforcing row-level security policies inherited from an enterprise identity provider. Those are genuinely hard distributed systems problems, and the lakehouse vendors have spent billions of dollars solving them. But the range of analytical workloads that fit comfortably inside a single modern node — a machine with 64 cores, a terabyte of RAM, and direct-attached NVMe storage — is far larger than the warehouse-first narrative has historically acknowledged. A 2025 benchmark by the DuckDB team showed the engine outperforming a 10-node Redshift cluster on the TPC-H 100 GB dataset while running on a single c6a.8xlarge EC2 instance. That benchmark is not an argument that Redshift is slow; it's an argument that the overhead of distributed query execution is enormous, and that for datasets under a certain size — a threshold that keeps rising as hardware improves — it is pure waste.

We're seeing a fundamental unbundling of the warehouse. Storage is commoditizing around Iceberg. Compute is becoming a library you import. What's left — the catalog, the governance, the semantic layer — is where the real competition is. Everything else is just implementation.— A platform architect at a major streaming company, speaking off the record at a closed-door data infrastructure meetup in San Francisco

This unbundling has a second-order effect that is only now becoming visible: it changes who gets to participate in the analytical data stack. For the past two decades, running analytical SQL at scale meant choosing a vendor — Oracle, Teradata, Snowflake, Databricks, BigQuery — and accepting that vendor's pricing model, lock-in profile, and release cadence. But an open table format like Iceberg, combined with a portable query engine like DuckDB or Trino, means you can own your data in an open format on your own storage and bring whatever compute engine makes sense for the query at hand. A company might run production dashboards on Trino, ad-hoc exploration on DuckDB, and periodic batch transformations on Spark — all against the same Iceberg tables in the same S3 bucket. The warehouse, in this model, is not a product you buy. It is a behavior you compose from open components, each specialized for a different part of the analytical workload. The industry term for this is "disaggregated" or "headless" architecture, but the more accurate description is that the warehouse has become a set of interchangeable parts.

The vendor response has been predictable and revealing. Snowflake has invested heavily in its Polaris Iceberg catalog, open-sourcing it in 2024 and positioning it as the neutral metadata layer for the ecosystem — while simultaneously making Snowflake's own query engine the most natural consumer of Polaris-managed tables. Databricks acquired Tabular, the company founded by the creators of Iceberg, in 2024, and has been aggressively pushing Unity Catalog as the governance plane for the lakehouse. Both companies understand the same strategic reality: if the storage format is open and the compute engine is portable, the only durable source of lock-in is the metadata and governance layer. As a Databricks engineer put it to me recently, "the catalog is the new database." SAP's acquisition of Dremio, a lakehouse vendor built around Apache Arrow and Iceberg, is the latest move in this consolidation — and a signal that the enterprise software giants have concluded that the lakehouse is not a transitional architecture but the steady state.

What makes this moment genuinely different from previous database platform transitions is the collapse of the network boundary. Historically, a database was something you connected to. The connection — a TCP socket, a TLS handshake, an authentication exchange — was part of the definition. In-process OLAP engines like DuckDB eliminate the network hop, which turns out to be not just a performance optimization but a qualitative change in how people build with data. A developer can write a unit test that runs a SQL query against a local Parquet file and gets results back in 3 milliseconds — fast enough that analytics becomes a building block inside application logic, not a separate workload you dispatch to a remote system. This is the same transformation that SQLite brought to transactional workloads two decades ago: an embedded database that redefined the latency envelope for reads and writes, and in doing so enabled entire new categories of software. DuckDB is trying to do for analytics what SQLite did for transactions, and the early evidence suggests the impact will be similarly broad.

What the benchmarks actually measure

Benchmarks in the analytical database world have a long and dishonorable history of measuring the wrong thing. The TPC-H benchmark, long the industry's default yardstick, consists of 22 queries against a synthetic dataset modeling a parts supplier business. Every vendor optimizes for these 22 queries, which means TPC-H results are best understood as a measure of a vendor's willingness to game a known workload, not a measure of real-world performance. The ClickBench benchmark, introduced by ClickHouse in 2022 and now adopted as a cross-engine baseline by a wide range of projects, is an improvement: it uses a real-world dataset (anonymized web traffic logs), runs 43 queries of varying complexity, and tests engines under conditions that more closely approximate production. When DuckDB submits to ClickBench, it does so on a single node with default settings — a deliberate choice that reflects the project's philosophy that good defaults and out-of-the-box performance matter more than peak throughput after hand-tuning. It performs competitively against distributed engines running on much larger clusters, which says less about DuckDB's engineering and more about the overhead that distributed systems impose on queries that don't actually require distribution.

The failure mode that benchmarks rarely capture is what happens when data grows beyond the memory budget of a single node. DuckDB handles this case through a spill-to-disk mechanism that is, in the project's own documentation, "not as optimized as the in-memory execution path." At 100 GB, a single node with sufficient RAM will keep the entire working set in memory and deliver interactive performance. At 1 TB, query plans that require shuffling — joins where the build side doesn't fit in memory, for example — will spill aggressively, and performance degrades in a nonlinear fashion. At 10 TB, unless your queries are embarrassingly parallel scan-and-filter operations, you are likely to hit a wall that no amount of optimization on a single node will solve. This is the territory where distributed engines earn their complexity budget. The lakehouse vendors, for their part, have spent years tuning their query planners to decide when to broadcast a small table to every node and when to shuffle a large one, and that intelligence is not trivial to replicate. The point, however, is that for the vast majority of analytical datasets in the world — internal company data, departmental analytics, startup-scale product data — 1 TB is already an outlier, and 10 TB is vanishingly rare. The in-process engine covers the middle 80% of the analytical workload distribution.

What breaks under failure

Every architectural choice pays for itself under failure, not under demo. The warehouse-on-a-library model makes one trade-off that is rarely discussed in the enthusiastic conference talks: it has no story for high availability, zero-downtime upgrades, or concurrent user isolation. When your analytical database is a library inside your application process, a crash in your application takes the database with it. A long-running analytical query that chews through 40 GB of memory can trigger the OOM killer and take down your entire web server. There is no query queue, no admission control, no workload management. These are not deficiencies in DuckDB's implementation; they are consequences of the embedded model itself. A library cannot provide availability guarantees that the host process does not provide.

This is where the lakehouse architecture and the embedded model complement each other in ways that the vendors are only beginning to formalize. The data lives on S3 in Iceberg format, durable and independently accessible. The compute — whether DuckDB, Trino, Spark, or a proprietary engine — is ephemeral and stateless. If a DuckDB process crashes mid-query, you restart it and run the query again. No data is lost because no data was stored locally (except the temporary spill files, which are disposable by design). This is the same "separation of storage and compute" that Snowflake popularized, but pushed to its logical endpoint: compute is not just a separate cluster you can resize; it's a library you can instantiate anywhere, for any duration, at zero marginal cost. The operational model shifts from "keep the warehouse running 24/7 and pay for idle capacity" to "spin up compute for the duration of a query and then discard it." MotherDuck, the managed service built on DuckDB, has leaned into this model with what it calls "bring your own compute" — the idea that you connect your own compute resources to a shared metadata and catalog layer, and the service handles only coordination, caching, and governance. It's a vision of the warehouse as a thin control plane over portable storage and user-supplied compute, and it represents a more radical disaggregation than anything Snowflake or Databricks have shipped.

The schema constraints tell a parallel story. A traditional warehouse expects you to define tables, columns, and types before you load data — the "schema-on-write" model that Teradata and its successors enforced for decades. A data lake, famously, imposes no such requirement; you dump raw files and figure out the schema at query time, a practice that gave rise to the derisive phrase "data swamp" when organizations failed to govern what they had stored. The lakehouse attempts to split the difference by imposing schema validation at the table level through the Iceberg or Delta Lake metadata layer, while remaining flexible about how data arrives. DuckDB, being an in-process engine with no persistent catalog of its own, is agnostic: it will happily query Parquet files with no predefined schema, inferring types on the fly, and it will also respect Iceberg table schemas when they exist. This flexibility is not a design flaw; it reflects the reality that schemas are a governance concern, not an engine concern. The engine's job is to execute queries efficiently against whatever bytes it's pointed at. The catalog's job is to enforce the schemas and access controls that make those bytes trustworthy.

If you zoom out, the data infrastructure industry is converging on a set of assumptions that would have seemed implausible five years ago. Storage is S3 (or Azure Blob, or GCS), full stop. The table format is Iceberg, with Delta Lake as a secondary contender and Hudi increasingly niche. The catalog — Polaris, Unity, or an open-source alternative like Gravitino — holds metadata, schema, and access policies. Compute is a library, a container, or a serverless function, chosen per query based on latency requirements and cost. The warehouse, as a distinct product category with a dedicated fleet of servers you provision and pay for, is evaporating into a set of behaviors distributed across the stack. This is not to say that Snowflake and Databricks are going away — they each do tens of billions in annual revenue and have product moats that extend well beyond query execution — but it is to say that the center of gravity has shifted. The "duck typing" of warehouses is both a joke and a genuine architectural observation: if you can run fast SQL analytics against open-format data using a library you imported with pip, in what sense is that not a warehouse?

The checkpoints to watch over the next eighteen months are not about features or benchmarks. Watch the catalog. Whoever controls the metadata layer that points queries at the right Iceberg table snapshots, enforces column-level masking, and routes access through a unified governance plane will own the economics of the analytical stack, even if the storage and compute layers are fully commoditized. Watch the DuckDB extension ecosystem, particularly the iceberg and delta extensions, which are being developed in the open with contributions from organizations that would prefer not to pay Databricks or Snowflake for the privilege of querying their own data. And watch the postmortems. The first high-profile production outage caused by an embedded OLAP engine running a pathological query inside a critical application process will tell us more about the limits of the model than any benchmark ever could. The warehouse isn't dead. It's just learning to walk like a duck.

Read next

Progress 0% ≈ 14 min left
Subscribe Daily Brief

Get the Daily Brief
before your first meeting.

Five stories. Four minutes. Zero hot takes. Sent at 7:00 a.m. local time, every weekday.

No spam. Unsubscribe in one click.