Site Logo
Published on

Diskless - The Cloud-Native Evolution of Kafka

Authors

Note: Huge Kudos to Aiven for sponsoring this blog and leading the diskless proposal for the benefit of the broader Kafka ecosystem rather than narrow profit motives.

In the previous blog we talked about why traditional Kafka setup isn't great for the cloud era and how Diskless Kafka KIP-1150 can transform it into a leading solution for streaming use cases even for the Cloud. To learn more you can also read Aiven’s page, Diskless for Apache Kafka® and their latest blog post, Diskless Kafka is the tide and its rising .

In this blog we will go a bit deeper into the internal workings of KIP-1150 and compare it with Warpstream and Confluent Freight, exploring their architectural approaches, performance characteristics, costs, and operational aspects.

The emergence of KIP-1150, WarpStream, and Confluent Freight reflects a strong market need to decouple Kafka from expensive, traditional disk-based architectures in the cloud by leveraging object storage. While they share this common goal, they differ significantly in their architectural philosophies and implementations.

Let's look at different parameters and compare these three solutions.

Architecture and Implementation

Diskless Kafka (KIP-1150)

Kafka Architecture
  • This approach is an evolutionary extension of the existing Apache Kafka framework. It doesn't seek to replace Kafka entirely but rather adds a new, opt-in topic type called "Diskless Topics".
  • For these diskless topics, data is written directly to object storage, bypassing the traditional write path that involved broker-to-broker replication and local disk writes. The inherent durability of object storage then handles replication for persistence.
  • This shift enables a leaderless design for diskless partitions, meaning any broker can accept a write request for a diskless partition. This eliminates issues like hot spots and expensive cross-zone write traffic associated with the traditional leader-based model.
  • However, a key challenge in this leaderless model is maintaining strict message ordering within a partition. This is where the Batch Coordinator (KIP-1164) becomes critical.
Kafka Architecture
  • The Batch Coordinator's central responsibility is to assign globally unique and monotonic offsets to message batches after they have been written to the object store by potentially multiple brokers.
  • By centralizing the assignment of these offsets, the Batch Coordinator acts as the single source of truth for message order across all producers and brokers interacting with diskless topics.
  • KIP-1164 proposes a pluggable interface for this coordinator, with a default implementation using a regular, replicated Kafka topic and an embedded SQLite instance to store and query the metadata about message batch coordinates. This leverages Kafka's existing infrastructure for the coordinator's state.
  • Core Kafka brokers remain in this architecture. They continue to handle classic topics and the Kafka Raft (KRaft) metadata for the cluster. They also participate in the workflow for diskless topics by accepting writes, buffering data into "shared log segments" (objects), uploading these segments to the object store, and sending batch coordinate metadata to the Batch Coordinator.
  • A major driving principle behind Aiven's proposal is the intention to upstream these changes into the mainline Apache Kafka project. This aims to benefit the entire Kafka ecosystem and prevent vendor lock-in. Aiven has created a temporary "Inkless" fork to allow users to test this functionality while the KIPs are being discussed and developed within the Apache community.
  • While offering significant cost and operational benefits, Diskless Kafka is generally expected to have higher latency than traditional disk-based Kafka and is not initially intended for sub-100ms workloads.

WarpStream:

Kafka Architecture
  • WarpStream represents a revolutionary departure from traditional Kafka architecture. It was built from the ground up specifically for cloud object storage environments.
  • Instead of traditional Kafka brokers, WarpStream uses stateless "Agents". These Agents implement the Apache Kafka protocol, allowing them to interact seamlessly with standard Kafka clients.
  • A key differentiator is that WarpStream Agents do not store any persistent data locally. They rely entirely on customer-owned object storage (like S3, GCS, or Azure Blob Storage) as the primary and only storage medium for message data. This is a true "zero disk" architecture.
  • All metadata essential for the cluster (topic configurations, partition assignments, consumer group offsets) is managed by a separate, proprietary, highly available cloud metadata store operated by WarpStream (now Confluent). This store ensures strong consistency for metadata operations and enables the Agents to be stateless. While metadata is managed by the vendor, the message data remains in the customer's object storage.
  • WarpStream is offered via a Bring Your Own Cloud (BYOC) model, where the Agents are deployed within the customer's own cloud account and VPC. Message data resides in the customer's object storage buckets.
  • The stateless nature of the Agents and the separation of compute from storage enable trivial autoscaling and eliminate the need for data rebalancing during scaling events.
  • WarpStream aims to be a drop-in replacement for Kafka, maintaining compatibility with the Kafka wire protocol. However, there have been documented limitations in areas like Schema Registry advanced features, and certain client library interactions that require verification.

Confluent Freight Clusters

Kafka Architecture
  • Confluent Freight Clusters are not a standalone product or an open-source KIP, but a specific type of cluster available within the fully managed Confluent Cloud service.
  • It is built upon Confluent's proprietary Kora engine, which is the next-generation cloud-native engine underpinning various Confluent Cloud offerings.
  • Kora has been evolved to support a "direct write" mode specifically for Freight clusters. This allows data to be written directly to object storage (like S3).
  • Similar to the other solutions, this direct write mechanism bypasses local storage on the Kora brokers and, importantly, avoids expensive inter-Availability Zone (AZ) replication traffic for data durability.
  • Freight clusters are explicitly designed and marketed for "relaxed latency use cases". The trade-off for significant cost savings (claimed up to 90% cheaper than self-managing Kafka for suitable workloads) is higher latency, potentially up to a second or two, compared to Confluent's standard low-latency clusters.
  • Being part of Confluent Cloud, Freight clusters offer a fully managed, serverless-like experience. Confluent handles all operational tasks like provisioning, maintenance, patching, autoscaling, and monitoring. Autoscaling is managed through Confluent's eCKU system.

In essence, while all three aim to leverage object storage for cost reduction and operational ease in the cloud, they represent different paths: KIP-1150 seeks to evolve Kafka itself through open-source collaboration; WarpStream is a revolutionary, stateless, Kafka-compatible system built from scratch with a BYOC focus; and Confluent Freight is a proprietary feature within a managed service tailored for specific latency-tolerant workloads.

Performance - Latency and Throughput tradeoffs

A common characteristic across these solutions is an inherent trade-off involving increased latency compared to highly optimized disk-based Kafka deployments, as writing and reading from object storage is generally slower than local disks.

  • Aiven Diskless (KIP-1150): Expected to have higher latency than classic Kafka. Aiven reports P99 latencies around 3-3.5 seconds for their internal Diskless usage in BYOC, with a path to sub-2-second latencies with tuning. Reads from broker caches can be lower (sub-20ms). It is not initially intended for workloads requiring sub-100ms latency. Aims for gigabytes per second of throughput.

  • WarpStream: Latency is tunable based on the object storage used. With standard S3, P99 produce latency is around 500ms. Using Amazon S3 Express One Zone (S3EOZ), P99 produce latency can drop significantly to 169ms. Demonstrates high MiB/s throughput.

  • Confluent Freight: Explicitly designed for relaxed latency use cases, with potential latencies up to "a second or two". Targeted at high-throughput workloads and can scale to support over 30 GBps.

The ability to have both classic (low-latency) and diskless (higher-latency, lower-cost) topics in the same cluster via KIP-1150's per-topic configuration is a significant advantage, allowing users to balance cost and performance needs.

Scalability/Elasticity

Decoupling storage from compute is a primary driver for achieving cloud-native scalability and elasticity.

  • Aiven Diskless (KIP-1150): Designed for instant autoscaling for diskless topics because there is no persistent data tied to specific brokers; they can be spun up or down in seconds. The leaderless design for diskless topics aids in scaling.

  • WarpStream: Offers "trivial" autoscaling due to its fully stateless Agent architecture; Agents are non-special and can be scaled based on load without rebalancing data.

  • Confluent Freight: Provides automatic scaling of compute resources (eCKUs) based on workload demand as a managed feature within Confluent Cloud

Total cost of ownership

All three solutions aim to significantly reduce TCO compared to traditional Kafka in the cloud.

  • Traditional Kafka: High TCO primarily due to expensive inter-AZ replication traffic (often >80% of the bill) and pricey local SSDs.

  • Aiven Diskless (KIP-1150): Aims for up to 80% TCO reduction. This is achieved by eliminating inter-AZ replication for diskless topics and using cheaper object storage. Costs shift towards object storage API calls/capacity and potentially Aiven's managed fees (for BYOC).

  • WarpStream: Claims over 80% infrastructure savings versus self-managed Kafka. Achieves this by eliminating inter-AZ networking costs and using cheaper object storage. Costs include object storage API calls/capacity (including S3EOZ if used) and Confluent's fees for the managed control plane.

  • Confluent Freight: Claims to be up to 90% cheaper than self-managing Kafka for suitable workloads. Costs are integrated into Confluent Cloud's pricing model.

The actual TCO depends heavily on workload specifics, retention policies, and the chosen deployment/management model

Kafka API compatibility

Maintaining compatibility with the Kafka protocol allows existing clients and applications to integrate easily.

  • Aiven Diskless (KIP-1150): The KIPs aim for no changes to existing Kafka client APIs. However, Aiven's current "Inkless" implementation has noted limitations, including no support for transactions, compacted topics, or Kafka Streams state stores on diskless topics.

  • WarpStream: Designed as a drop-in replacement, compatible with the Kafka wire protocol, allowing standard clients to connect. Its Schema Registry is API-compatible with Confluent's, though it lacks some advanced features like data contracts.

  • Confluent Freight: As part of Confluent Cloud, it is expected to offer high compatibility with the Kafka protocol and the Confluent ecosystem

Operational complexity

All three solutions aim to reduce the significant operational burden associated with traditional Kafka's disk-based nature.

  • Aiven Diskless (KIP-1150): Reduces toil by eliminating operational headaches like data rebalances, managing hot partitions, and hitting IOPS limits for diskless topics. Can be self-managed (if KIPs upstream) or managed via Aiven's BYOC service.

  • WarpStream: Offers a significant reduction in operational burden due to its stateless Agents and reliance on a managed control plane, eliminating tasks like disk management and manual rebalancing. It operates on a shared responsibility BYOC model.

  • Confluent Freight: Provides a fully managed, serverless-like experience through Confluent Cloud, with Confluent handling cluster operations and scaling.

Maturity

Each solution is at a different stage of maturity:

  • Aiven Diskless (KIP-1150): The KIPs are under discussion within the Apache Kafka community. Aiven offers an early implementation in their "Inkless" fork and in limited availability for their BYOC customers.

  • WarpStream: Launched in 2023, was an independent startup with production users, and was acquired by Confluent in 2024. It is now part of Confluent's offerings.

  • Confluent Freight: Became generally available in February 2025 as a new feature within Confluent Cloud.

Vendor lock-in

The degree of vendor lock-in is a key differentiator:

  • Aiven Diskless (KIP-1150): Strong emphasis on open source and upstreaming. If KIP-1150 is accepted and merged into Apache Kafka, it aims to provide an open standard accessible to everyone, minimizing vendor lock-in. Aiven's BYOC service involves a management fee, but the underlying technology is intended to be open.

  • WarpStream: Operates on a BYOC model where data resides in the customer's cloud. However, reliance on the proprietary, Confluent-managed control plane introduces vendor relationship.

  • Confluent Freight: As a feature of Confluent Cloud, it is a fully proprietary solution, implying a direct vendor lock-in with Confluent.

Comparison Summary

FeatureDiskless Kafka (KIP-1150)WarpStreamConfluent Freight Clusters
ArchitectureEvolutionary extension of Apache Kafka; opt-in "Diskless Topics"; leaderless design with Batch Coordinator; brokers remain.Built for cloud object storage; stateless "Agents"; zero disk architecture; separate metadata store operated by WarpStream/Confluent.Built on Confluent's Kora engine; "direct write" to object storage; bypasses local storage; part of Confluent Cloud.
StorageObject storage for diskless topics; traditional storage for classic topics.Customer-owned object storage (e.g., S3).Object storage (e.g., S3) via Kora engine.
ReplicationHandled by object storage for diskless topics; traditional replication for classic topics.Object storage handles data persistence.Avoids inter-AZ replication with direct write to object storage.
LatencyHigher latency than classic Kafka; 3-3.5 seconds P99 reported; not initially for sub-100ms workloads.Tunable; ~500ms P99 with standard S3; ~169ms P99 with S3 Express One Zone.Higher latency; potential latency up to "a second or two"; designed for relaxed latency use cases.
ThroughputGigabytes per second throughput.High MiB/s throughput.Scales to support over 30 GBps.
Scalability/ElasticityInstant autoscaling for diskless topics; leaderless design aids scaling.Trivial autoscaling due to stateless Agents; no data rebalancing needed.Automatic scaling of compute resources (eCKUs) based on workload.
Cost ReductionAims for up to 80% TCO reduction; shifts costs to object storage.Claims over 80% infrastructure savings; shifts costs to object storage and Confluent's managed control plane fees.Claims to be up to 90% cheaper than self-managing Kafka; integrated into Confluent Cloud pricing.
Kafka API CompatibilityAims for no changes to existing APIs; current "Inkless" implementation has limitations (e.g., no transactions).Designed as a drop-in replacement; compatible with Kafka wire protocol; some documented limitations (e.g., Schema Registry features).Expected to offer high compatibility with Kafka protocol and Confluent ecosystem.
Operational ComplexityReduces toil by eliminating data rebalances and hot partitions for diskless topics.Significant reduction due to stateless Agents and managed control plane.Fully managed, serverless-like experience through Confluent Cloud.
MaturityKIPs under discussion; Aiven offers early implementation.Launched in 2023; acquired by Confluent in 2024; now part of Confluent's offerings.Generally available since February 2025 as a feature within Confluent Cloud.
Vendor Lock-inAims for minimal vendor lock-in; strong emphasis on open source and upstreaming.BYOC model with data in customer's cloud; reliance on Confluent-managed control plane introduces vendor relationship.Fully proprietary solution; direct vendor lock-in with Confluent.

Guidance: Selecting the Right Solution for Specific Use Cases

Choosing the most appropriate solution requires a careful assessment of workload characteristics, latency requirements, cost sensitivity, operational preferences, and desired level of Kafka feature compatibility.

For Ultra-Low Latency Requirements (sub-50ms):

  • Traditional disk-based Apache Kafka, meticulously tuned, remains the most proven option for the strictest low-latency demands.
  • Among the object-storage-centric solutions, WarpStream configured with S3 Express One Zone shows promise for approaching these latencies, but this comes at a higher cost and may still not match optimized disk-based systems in all scenarios.
  • Aiven Diskless (KIP-1150), while aiming to improve, is not initially targeted for sub-100ms workloads.
  • Confluent Freight is explicitly not suitable for such low-latency use cases.

For High-Throughput, Cost-Sensitive, Relaxed Latency Workloads (e.g., logging, telemetry, archival, batch data feeds):

  • Confluent Freight Clusters: A strong candidate for organizations already within or considering the Confluent Cloud ecosystem, valuing a fully managed service and significant cost savings for latency-tolerant applications.
  • Aiven Diskless Kafka (KIP-1150): Appealing for those prioritizing open-source alignment, with the potential for self-management (if KIPs are upstreamed) or Aiven's managed BYOC offering. The evolving maturity of the KIPs and current implementation limitations are key considerations.
  • WarpStream (by Confluent): A suitable BYOC option, demonstrating scalability and cost-effectiveness. It becomes particularly attractive if the higher latency of standard object storage is acceptable, or if S3EOZ is viable for specific performance needs.

For BYOC and Data Sovereignty Needs:

  • WarpStream: Explicitly designed as a BYOC solution, allowing data to reside within the customer's cloud environment.
  • Aiven Diskless Kafka (KIP-1150): Offered via Aiven's BYOC model, providing similar control over data location.

For Emphasis on Open Source and Community Alignment:

  • Aiven Diskless Kafka (KIP-1150): Has the strongest alignment, with Aiven actively working to contribute the KIPs to the upstream Apache Kafka project.
  • WarpStream is now a proprietary Confluent product, and Confluent Freight is a proprietary managed service.

For the Fullest Kafka Feature Set (including transactions, advanced admin tools, mature ecosystem integration):

  • This requires careful validation for each solution. Traditional Apache Kafka (and mature managed services based on it, like Confluent Platform or standard Confluent Cloud clusters) currently offers the most comprehensive feature set.
  • Aiven's current Diskless implementation has documented limitations regarding transactions, compacted topics, and Kafka Streams state stores.
  • WarpStream's complete feature parity post-acquisition, especially for advanced Kafka functionalities, needs further assesment for specific workloads.
  • Confluent Freight, while part of the rich Confluent ecosystem, is new, and the impact of its specific architecture on all Kafka features (especially latency-sensitive ones or those relying on traditional replication mechanics) should be tested for specific workloads.

Conclusion

The shift towards diskless or object-storage-backed architectures represents a fundamental rethink of how Kafka should operate in cloud environments, driven by the need to address the significant cost and operational challenges of traditional disk-based deployments.

  • KIP-1150 offers a path to integrate this functionality directly into open-source Apache Kafka, providing an opt-in, per-topic solution with a dual capability for low-latency classic topics and cost-efficient diskless topics. Its success hinges on community acceptance and the maturation of its integral KIPs, particularly the Batch Coordinator.
  • WarpStream pioneered a revolutionary, stateless architecture built for object storage from the ground up, offering significant cost savings and operational simplicity through its BYOC model, now under Confluent's stewardship. Its latency profile is tunable depending on the object storage tier used.
  • Confluent Freight provides a fully managed, cost-optimized option within Confluent Cloud, targeting latency-tolerant, high-throughput workloads by leveraging direct writes to object storage via the Kora engine.

While diskless solutions introduce trade-offs, primarily increased latency compared to highly optimized disk-based systems for certain operations, the ability to mix and match storage strategies (as proposed by KIP-1150's per-topic option) allows organizations to select the appropriate balance for different workloads.

This movement is not just about cost savings; it's about enabling Kafka to be more scalable, elastic, and operationally simpler in the cloud, aligning with cloud-native principles. The ongoing discussions and implementations signal that leveraging object storage is the future for many Kafka use cases, helping it remain the definitive backbone of modern streaming ecosystems in the cloud era. The tide is indeed rising for diskless architectures in the Kafka world.

Cheers,
The GeekNarrator