Architecture
System Overview
Section titled “System Overview”Component Deep Dive
Section titled “Component Deep Dive”SDK Layer
Section titled “SDK Layer”All Aires SDKs share the same Rust core library (aires-sdk). The TypeScript SDK uses NAPI-RS to compile the Rust code into a native Node addon (.node file). The Python SDK uses PyO3 to compile it into a native Python extension (.so / .pyd). The Rust SDK is used directly.
When you call aires.info("message"), the SDK:
-
Creates an
Event— allocates a UUID v7 (time-ordered), captures the nanosecond timestamp, sets the severity, message, service name, and environment from config. -
Applies options — any trace ID, span ID, attributes, tags, HTTP info, error info, etc. are set on the event via a builder pattern.
-
Submits to the batch channel — the event is sent through a
crossbeam-channel(lock-free MPSC) to the background batch worker. This call is non-blocking and returns immediately. -
Batch worker collects and ships — a dedicated Tokio task drains the channel. When
batch_size(default: 256) events accumulate, orbatch_timeout(default: 500ms) elapses, the worker Protobuf-encodes the batch and sends it over gRPC. -
Retries on failure — if the gRPC call fails, the worker retries with exponential backoff up to
max_retries(default: 3) times. If all retries fail, the batch is dropped and a warning is logged.
Collector Layer
Section titled “Collector Layer”The collector is an Elixir/OTP application with four supervision children:
| Component | Module | Role |
|---|---|---|
| Store | AiresCollector.Store | GenServer managing a ClickHouse connection via the ch library |
| gRPC Server | AiresCollector.Endpoint / AiresCollector.Server | Accepts Ingest and IngestStream RPCs |
| Pipeline | AiresCollector.Pipeline | Broadway pipeline for batched processing |
| Telemetry | AiresCollector.Telemetry | Exposes collector metrics (events received, inserted, failed) |
When an EventBatch arrives:
- The gRPC server decodes the Protobuf message
- Each event is transformed from the nested Proto structure to a flat map (
AiresCollector.Transform.event_to_row/3) - Transformed rows are pushed into the Broadway pipeline as messages
- Broadway’s batcher groups rows (default: 1000 rows or 500ms) and calls
Store.insert_batch/1 - The Store issues a batched
INSERT INTO eventsstatement to ClickHouse
Broadway provides automatic backpressure — if ClickHouse inserts slow down, the pipeline slows the producer, which in turn applies gRPC flow control to the SDKs.
Storage Layer
Section titled “Storage Layer”ClickHouse stores events in a MergeTree table with:
- Partition key:
toDate(timestamp)— one partition per day, enabling efficient date-range queries and TTL-based cleanup - Sort order:
(service, severity, timestamp)— optimized for the most common query patterns (filter by service, then severity, then time range) - Bloom filter indexes: on
trace_id,span_id,session_id— enables fast point lookups on high-cardinality string columns - Token bloom filter: on
message— enables full-text-like search on log messages
Materialized views run in the background as data is inserted, maintaining pre-aggregated tables for common dashboard queries (error rates, HTTP latency percentiles).
Data Flow Summary
Section titled “Data Flow Summary”Total latency from event creation to queryable in ClickHouse is typically < 2 seconds under normal load, dominated by the two batching stages (SDK-side and collector-side).
Scaling
Section titled “Scaling”Horizontal Scaling
Section titled “Horizontal Scaling”- SDKs: Each application instance runs its own batch worker. No coordination needed.
- Collector: Run multiple collector instances behind a load balancer. Each instance is stateless — it just transforms and inserts. gRPC load balancing works with standard L4/L7 balancers or Kubernetes services.
- ClickHouse: Use ClickHouse’s native replication and sharding for write throughput beyond a single node.
Capacity Planning
Section titled “Capacity Planning”| Component | Throughput (single instance) |
|---|---|
| SDK batch worker | ~500K events/sec (Rust), ~200K events/sec (TS/Python due to binding overhead) |
| Collector | ~100K events/sec per Broadway pipeline (limited by ClickHouse insert speed) |
| ClickHouse | ~500K rows/sec insert (single node, depends on column count and hardware) |
For most applications, a single collector instance and a single ClickHouse node handle the load comfortably. Scale horizontally when you exceed these numbers.