OLAP Backends
Yoda supports two OLAP engines: Apache DataFusion (default) and DuckDB. Both implement the same OlapEngine trait, so switching backends requires only a config change.
Quick Comparison
| Feature | DataFusion | DuckDB |
|---|---|---|
| Feature flag | datafusion-backend (default) | duckdb-backend |
| C++ dependency | None — pure Rust | Yes — bundled via duckdb-sys |
| Async model | Natively async | spawn_blocking wrappers |
| Transactions | No (no-ops) | Yes — full ACID |
| Bulk-load path | Arrow batch → load_arrow | Arrow Appender API (zero-copy) |
| Storage modes | InMemory / ArrowIpc / Parquet / S3 / GCS | InMemory or single .duckdb file |
| Streaming results | Native (RecordBatchBoxStream) | Collect-then-stream |
| Primary key enforcement | No | Yes (in destructive mode) |
| Compile time | Fast | Slower (C++ bundled build) |
DataFusion
DataFusion is the default OLAP engine. It is a pure-Rust, natively async columnar query engine built on the Apache Arrow in-memory format.
Enable it (or keep the default):
yoda = "1"
# or explicitly:
yoda = { version = "1", features = ["datafusion-backend"] }Storage Modes
DataFusion's storage is configurable via HtapConfig::datafusion_storage (StorageMode enum):
use yoda::{HtapConfig, StorageMode};
// In-memory (default) — no persistence
let config = HtapConfig {
datafusion_storage: StorageMode::InMemory,
..HtapConfig::default()
};
// Arrow IPC files — fast durable writes
let config = HtapConfig {
olap_in_memory: false,
datafusion_storage: StorageMode::ArrowIpc {
path: "/var/lib/myapp/olap-arrow".into(),
},
..HtapConfig::default()
};
// Parquet — compressed, predicate pushdown
let config = HtapConfig {
olap_in_memory: false,
datafusion_storage: StorageMode::Parquet {
path: "/var/lib/myapp/olap-parquet".into(),
},
..HtapConfig::default()
};Cloud backends require the cloud-storage feature:
// S3 (requires cloud-storage feature + AWS_* env vars)
let config = HtapConfig {
datafusion_storage: StorageMode::S3Parquet {
url: "s3://my-bucket/analytics/".to_string(),
},
..HtapConfig::default()
};
// GCS (requires cloud-storage feature + GOOGLE_* env vars)
let config = HtapConfig {
datafusion_storage: StorageMode::GcsParquet {
url: "gs://my-bucket/analytics/".to_string(),
},
..HtapConfig::default()
};Bulk Loading
CDC INSERT batches are loaded via OlapEngine::load_arrow(), which accepts an Arrow RecordBatch directly. DataFusion appends the batch without any SQL string construction. The fallback to SQL only occurs for column types not yet handled by the Arrow builder (Date, Timestamp, Decimal, List, Struct).
Streaming Queries
DataFusion implements OlapEngine::query_stream() natively via DataFrame::execute_stream(), returning a RecordBatchBoxStream that emits batches without buffering the entire result set. The Arrow Flight SQL server uses this for all data transfer.
No Transactions
DataFusion's transaction support is a no-op. In SyncMode::Temporal, the UPDATE (close previous version) + INSERT (new version) pair is not atomic — a crash between the two leaves an open-ended previous version and a missing new version until the engine resumes and reprocesses from the last committed sequence number.
DuckDB
DuckDB embeds the full DuckDB columnar engine via duckdb-sys (C++ bundled build). It provides excellent SQL compatibility, MVCC-based concurrent reads, and ACID transactions.
Enable DuckDB:
yoda = { version = "1", features = ["duckdb-backend"] }
# or both backends:
yoda = { version = "1", features = ["full"] }Configure the engine:
use yoda::{HtapConfig, OlapBackendType};
// In-memory DuckDB
let config = HtapConfig {
olap_backend: OlapBackendType::DuckDb,
olap_in_memory: true,
..HtapConfig::default()
};
// Persistent DuckDB — single file
let config = HtapConfig {
olap_backend: OlapBackendType::DuckDb,
olap_in_memory: false,
olap_path: Some("/var/lib/myapp/olap.duckdb".to_string()),
..HtapConfig::default()
};Bulk Loading
DuckDB's load_arrow() uses the native Appender API (append_record_batch) for zero-copy Arrow ingestion. All Arrow types — including Date, Timestamp, Binary, and Decimal — are handled natively without SQL literal serialisation. This is the fastest bulk-load path across both backends.
Thread Safety
duckdb::Connection is !Send. Yoda wraps each connection in a Mutex with unsafe impl Send/Sync and dispatches all operations through spawn_blocking. The engine maintains a single write connection and a read pool (default: 4) using DuckDB's MVCC for concurrent read isolation.
Transactions
DuckDB supports full ACID transactions. In SyncMode::Temporal, the close + insert pair for each UPDATE or DELETE is wrapped in a single BEGIN … COMMIT, making it fully atomic. This is the key advantage of DuckDB over DataFusion for temporal workloads.
datafusion_storage is Ignored
When olap_backend = OlapBackendType::DuckDb, the datafusion_storage field has no effect. DuckDB uses either in-memory mode or a single .duckdb file, controlled by olap_in_memory and olap_path.
When to Pick Which
Choose DataFusion when:
- You want zero C/C++ dependencies (CI, cross-compilation, WebAssembly targets).
- You need cloud object storage (S3, GCS) natively.
- Your workload is primarily append-only (INSERTs) — the Arrow batch path is equally fast.
- Temporal mode atomicity is not critical (or you run DuckDB for temporal and DataFusion for destructive).
Choose DuckDB when:
- You need
SyncMode::Temporalwith atomic UPDATE/DELETE transitions. - You need primary-key enforcement on the OLAP mirror.
- You want a single durable file for the OLAP store with a familiar SQL dialect.
- You need the DuckDB extension ecosystem (spatial, JSON, HTTPFS, etc.) via raw queries.
Next Steps
- Sync Modes — DuckDB atomicity matters most for temporal mode
- Configuration Reference —
StorageMode,OlapBackendType, and persistence settings - Arrow Flight SQL — streaming OLAP queries over gRPC