Skip to main content

· 7 min read
Tao Wu
Heng Ma

The amount of streaming data has grown explosively over the past few years. A lot of businesses realize that they need to move to stream processing, but they have a hard time figuring out the route to take. Most of the stream processing frameworks out there are too complex to design and implement.

RisingWave is a cloud-native streaming database that uses SQL as the interface. It is designed to reduce the complexity and cost of building real-time applications. RisingWave consumes streaming data, performs continuous queries, and updates results dynamically.

Redpanda is an Apache Kafka®-compatible streaming data platform. It was built from the ground up with performance and simplicity in mind. It requires no Zookeeper®, no JVM, and no code changes.

RisingWave works seamlessly with Redpanda to provide a real-time data streaming and processing solution that makes it so much easier to build and maintain real-time applications.

Overview

In this tutorial, you will learn how to use RisingWave to consume Redpanda data streams and perform data analysis. We will use ad impression and click events as sample data and try to count clicks of an ad within one minute after the ad was shown.

Below is the schema of the ad impression and click events:

{
"user_id": 2926375,
"click_timestamp": "2022-05-11 16:04:06.416369",
"impression_timestamp": "2022-05-11 16:04:06.273401",
"ad_id": 8184596
}

For users who are not familiar with digital advertising, an impression is counted whenever an ad is displayed within an app or on a website. impression_timestamp is the date and time when the ad was shown to a user. In the schema, impression_timestamp should be smaller (earlier) than click_timestamp to ensure that only clicks subsequent to impressions are counted.

We have set up a demo cluster specifically for the Redpanda and RisingWave stack so that you do not need to install them separately.

Prerequisites

  • Ensure you have Docker and Docker Compose installed in your environment. Note that Docker Compose is included in Docker Desktop for Windows and macOS. If you use Docker Desktop, ensure that it is running before launching the demo cluster.
  • Ensure that the PostgreSQL interactive terminal, psql, is installed in your environment.
    • To install psql on macOS, run this command: brew install postgres
    • To install psql on Ubuntu, run this command: sudo apt-get install postgresql-client

Step 1: Launch the demo cluster

First, let us clone the risingwave-demo repository to your environment.

git clone https://github.com/singularity-data/risingwave-demo.git

Now let us navigate to the ad-click directory and start the demo cluster from the docker compose file.

cd ad-click
docker-compose up -d

A Redpanda instance and necessary RisingWave components, including frontend node, compute node, metadata node, and MinIO, will be started.

A workload-generator is also packaged in the docker-compose. It will generate some random data and feed them into Redpanda.

Step 2: Connect RisingWave to the Redpanda stream

Now let us connect to RisingWave so that we can manage data streams and perform data analysis.

psql -h localhost -p 4566 -d dev -U root

Note that RisingWave can be connected via psql on port 4566 by default, while Redpanda will listens on port 9092. If you intend to ingest data directly from Redpanda, you should use port 9092 instead.

We set up the connection with a Redpanda topic with this SQL statement:

create source ad_source (
user_id bigint,
ad_id bigint,
click_timestamp timestamp,
impression_timestamp timestamp
) with (
'connector' = 'kafka',
'kafka.topic' = 'ad_clicks',
'kafka.brokers' = 'message_queue:9092',
'kafka.scan.startup.mode' = 'latest'
) row format json;

Let us dive a little deeper into the parameters in the WITH clause:

  • ‘connector' = 'kafka': As Redpanda is Kafka-compatible, it can be connected in the same way as Kafka.
  • ‘kafka.topic’ = ‘user_activities’: The Redpanda topic.
  • 'kafka.brokers' = ‘redpanda:9092': The addresses of the Redpanda broker.
  • ‘kafka.scan.startup.mode’ = ‘earliest’: It means the RisingWave will start to consume data from the earliest entry in the stream. Alternatively, you can set this parameter to ‘latest`, which means RisingWave will start to consume data from the latest entry.

Step 3: Analyze the data

We’ll define a materialized view to count the clicks on each ad within one minute after the ad was shown.

With materialized views, only incremental calculations are performed each time a new event comes in, and the results are persisted right after calculations for a new event are completed.

create materialized view m_click_statistic as 
select
ad_id,
count(user_id) as clicks_count
from
ad_source
where
click_timestamp is not null
and impression_timestamp < click_timestamp
and impression_timestamp + interval '1' minute >= click_timestamp
group by
ad_id;

We want to make sure that only ads that have been clicked are calculated, so we limit the scope by using the click_timestamp is not null condition. Any clicks one minute after the impression are considered as non-relevant and therefore have been excluded. That is why we include the impression_timestamp + interval '1' minute >= click_timestamp condition.

Step 4: Query the results

RisingWave is designed to achieve both second-level freshness and low query-latency via pre-aggregations on streams. Downstream applications can query results at extremely short intervals if needed.

We query the results with the following statement:

select * from m_click_statistic;

The results may look like this:

ad_id | clicks_count 
------+--------------
1 | 356
2 | 340
3 | 319
4 | 356
5 | 333
6 | 368
7 | 355
8 | 349
9 | 359
(9 rows)

If you query multiple times, you will be able to see that the results are changing as new events come in. For example, if you run the query again in 10 seconds, you may get the results as follows.

ad_id | clicks_count 
------+--------------
1 | 362
2 | 345
3 | 325
4 | 359
5 | 335
6 | 369
7 | 360
8 | 353
9 | 360
(9 rows)

When you finish, run the following command to remove the Docker containers.

docker-compose down

Summary

In this tutorial, we connected RisingWave to a Redpanda stream and performed basic ad performance analysis. The use case is a bit simple and intended to be inspirational rather than complete. If you want to share your ideas about what RisingWave can do or are interested in a particular use scenario, please let us know in the RisingWave Community workspace on Slack. Please use this invitation link to join the workspace.

· 12 min read
Yingjun Wu

Two months ago, we open-sourced RisingWave, a cloud-native streaming database. RisingWave is developed on the mission to democratize stream processing — to make stream processing simple, affordable, and accessible. You may check out our recent blog, document, and source code for more information about RisingWave.

Rome was not built in a day, and neither are database systems. We started developing RisingWave (from scratch!) in early 2021; we did a major code refactor in the summer of 2021; in April 2022, we open-sourced RisingWave under Apache License 2.0 to make it accessible to everyone. So, why and how did we build the system from scratch? Where are we now? And what’s next?

Building the system from scratch

Many successful database products are forked from existing battle-tested systems. While there already exist several fully-fledged streaming systems, such as Apache Storm, Apache Flink, and Apache Spark Streaming, we made the decision to build RisingWave from scratch. The key reason is that most existing streaming systems are not designed as a database where storage is a primary concern. They are also not built for cloud infrastructures. Thus, building a cloud database from a big-data system can bring heavy technical debt.

However, building RisingWave from scratch doesn’t mean that we build every single component on our own. In fact, we adopt several building blocks from the open-source community, including etcd, Apache Calcite (we recently deprecated Calcite-based SQL frontend), and sqlparser-rs. We also use Grafana and Prometheus to implement the observability framework of RisingWave.

The current status

RisingWave is under rapid development. We have over 70 contributors from across the globe, and we review and merge 100+ Pull Requests weekly. However, RisingWave is still far away from being recognized as a “fully-fledged” system. So, what’s the current status of RisingWave? In this section, we answer this question from several perspectives.

SQL interface

RisingWave aims to support the complete set of PostgreSQL commands. It has supported the core SQL commands that revolve around these four categories of operations so far:

  • Creating tables and sources. The fundamental functionality of a DBMS is to store data. RisingWave supports PostgreSQL-compatible relational model, and includes most PostgreSQL basic data types. In addition, RisingWave is able to connect to a large range of external sources to consume streaming data. Currently, RisingWave supports many popular data formats such as JSON, Protobuf, and Avro on its source. When creating the source, users can manually specify the mapping from the nested structure to a plain table.
  • Creating materialized views. To deliver continuous analytics for users, RisingWave allows users to define materialized views over both tables and sources. Materialized views can capture the accumulative analytical result, and RisingWave will continuously and incrementally maintain materialized views with a strong snapshot consistency guarantee. Currently, RisingWave supports a large set of PostgreSQL-compatible SQL queries, which covers a large set of TPC-H queries, in materialized views. Different from traditional databases, as a streaming database, RisingWave also (1) allows joins between tables and streaming sources; (2) natively supports streaming window functions, including tumbling windows, hopping windows, and sliding windows; and (3) allows materialized views to be defined on top of existing views, with potential joins on other streams.
  • Querying the database. RisingWave allows users to issue queries over both tables and materialized views directly, without additional systems as data sinks. Currently, RisingWave supports a large set of PostgreSQL standard user queries, including 20 out of 22 TPC-H queries.

Cloud-native

We have implemented some of our key designs to support a cloud-native streaming database.

  • Disaggregated architecture. RisingWave can be decoupled into different components, including SQL frontend service, compute nodes, storage service, and meta service. Under such design, each component can scale out (or in) independently according to its own usage, enabling fine-grained control over the resource usage, without the cost overhead of the wasted resources.
  • Distributed engine. At the core of RisingWave is its distributed streaming engine. We have built a high-performance distributed streaming engine completely from scratch, such that we take advantage of recent tech advances, like Rust async runtime. A decoupled storage layer on top of cloud service guarantees the durability and infinite capacity of state storage. We have built a cloud-native storage service Hummock, which is based on LSM-tree to optimize its write performance and extensively uses tiered caches to bridge the performance gap of the latency between accessing memory and cloud storage.
  • High availability. RisingWave maintains its high availability via a periodic checkpointing procedure. Whenever there is a fail detected in the system, RisingWave can automatically recover from the latest consistent checkpoint and resume the stream processing.

Observability

Observability is a vital part of building a reliable and debuggable system. We spend quite an effort in the observability of RisingWave. In particular, we add tracing to the whole pipeline inside the computing engine. We use Prometheus to collect various metrics of the system status, and we provide a dashboard using Grafana to visualize both real-time system metrics and the streaming plan for maintaining materialized views.

Deployment

We have open-sourced the Kubernetes operator for RisingWave, and the project is under heavy development. With Kubernetes operator, RisingWave can be deployed as either a cloud-hosted service or an on-premise deployment in self-hosted clusters on top of Kubernetes.

Development

We also allow users and developers to try RisingWave on their local desktops. We developed Risedev, a developer tool to automatically config all services and manage the dependency. Any developer can deploy a local testing playground of RisingWave within a simple line of command in the terminal.

Ecosystem

RisingWave is a part of the modern real-time data stack. We spend a huge amount of effort integrating RisingWave with other systems. Currently, RisingWave supports connecting to Apache Kafka, Redpanda, Apache Pulsar, etc. It can also ingest data from Debezium CDC. We plan to collaborate with many other open-source systems, including Redis, ElasticSearch, Apache Hudi, and Apache Iceberg, to build a thriving community together.

Our roadmap

Building a robust database system can take more than a decade, and we will be continuously focusing on improving the system's performance and robustness in the future. While the design and implementation can evolve over time to catch up with the latest technology trends, we do have a clear roadmap of the key features for the next two years.

Short-term objectives (within the next 3 months)

Our recent target in the upcoming three months is to deliver a fully functional streaming database. Hence, we will focus on the following objectives:

  • Richer SQL support. Our primary goal is to further support more SQL standards, for example, more generic nested subqueries, more expressions, and more functions.
  • Creating Index. Indexes are a major feature to speed up query processing in all databases. Chances are, such indexes can also be shared across the streaming engine to reduce the cost of state maintenance. RisingWave will allow users to issue create index commands. These index implementations will be further optimized to enhance both the batch query engine and the streaming engine.
  • Access control. The next focus will be on access control. We will fully support fine-grained access control by adding basic functionalities including system catalog, user authentication, etc.
  • Connecting with downstream systems. While RisingWave can serve the users’ query requests over the streaming data directly, it also allows users to export the streaming data into existing data infrastructures in the data pipeline. We are working on adding a create sink command in RisingWave, such that we keep the simplicity of the SQL interface without losing the functionality of traditional stream processing.
  • Highly concurrent serving. So far, we mainly focus on stream processing. In the mid-term, we will also turn our focus back to the query engine of the database. We have a plan which includes a series of optimizations tailored to high concurrent serving workloads. A major refactor on the architecture of our query engine is on the schedule. In particular, we will introduce the local execution mode to avoid expensive coordination among compute nodes during the query runtime.
  • Benchmark. People constantly ask about our performance benchmarking results. We haven’t done it so far, because RisingWave is still under rapid development, and the performance numbers will be out of date anytime. Nevertheless, this is definitely in our next plan! We will also carry out extensive tests on the cost and cost-performance ratio, which will drive the technical design of our future optimization.

Mid-term objectives (within the next 6 months)

Aside from performance enhancement and bug fixing, we will add more functionalities to the system within the next six months.

  • Elasticity. In contrast to the traditional approach for scaling out (or in), RisingWave stands on the shoulder of the state-of-the-art in both academia and industry. We have designed a novel and lightweight online scaling mechanism to cope with the streaming traffic spike. The idea is that, instead of shutting down the cluster entirely, we take a fine-grained approach inside the database kernel. This feature is under heavy development, and we expect it to be finished in six months.
  • UDF support. We plan to support user-defined functions (UDF) to enable our users to express a larger spectrum of applications.
  • Schemaless. One key observation is that users still need to define how source data is mapped from nested format to flat tables upon ingestion. This requires users to have knowledge of the source data in advance. To further simplify the users' usage, we will support schemaless sources in RisingWave. Users can ingest raw JSON data into RisingWave, and RisingWave can automatically infer the source schema, such that users can speculate the data schema and define views after the data arrives.
  • Web UI 2.0. We will upgrade our web UI to the next generation. This will cover: (1) a comprehensive observability platform monitoring all detailed metrics in a user-friendly interface; (2) an administrative web console for DBAs to easily manage the configuration of both the cluster and database instances.

Long-term objectives (within the next 18 months)

It is always hard to predict the future. One thing we know for sure is that there are many unknown factors. Nevertheless, we are also clear that our mission is to make stream processing simple, affordable, and accessible, to bring stream processing to the next level. To achieve that goal, we will be continuously exploring the following topics:

  • Multi-tenancy. We are witnessing the trend that multi-tenant architecture dominates cloud infrastructures, as it significantly reduces the cost for both cloud vendors and end users. The idea is to share resources among multiple users within a single database instance. To enable such sharing, we have to do a series of future work in tenant isolation, security control, schema management, and many other aspects.
  • Machine learning support. A major application for all data engineering pipelines is to support machine learning. Current machine learning tools are often decoupled with databases. Although such architecture introduces the flexibility of using different machine learning tools, it often comes at a price of cost and latency. There is a huge waste of resource usage and poor data freshness for online machine learning. One of our future objectives is to explore how the modern real-time data stack can be co-designed and seamlessly integrated with existing machine learning tools, such that both the freshness of data and the efficiency of model training and inferencing can be well preserved.
  • Adaptive stream processing. All database systems support query optimization. The query optimizer collects the statistics of the data and searches for the best query plan when users issue their queries. However, for streaming databases, the story is different. A streaming pipeline is pre-defined when creating materialized views before the data are actually ingested into the system. The technical challenge here is how to automatically adapt the streaming pipeline on-the-fly, and let the streaming engine learn how to fit the fluctuating streaming workloads from the data directly without human intervention.
  • Smart cloud scheduler. RisingWave is designed to take advantage of diverse and infinite resources on the cloud, or even across multiple clouds! Chances are, we can build a smart cloud scheduler that monitors the status of the streaming tasks, and automatically scales RisingWave to meet users' needs in SLA in the best configuration. Our long-term objective is to use state-of-the-art machine learning models to learn the best and most cost-efficient scaling policy from the workloads instead of some pre-fixed manual policy. Hence we can provide an experience of serverless computation for streaming.

Summary

RisingWave is a next-generation cloud-native streaming database designed to democratize stream processing. In this blog, we discussed how we plan to approach our objectives. This is not only our internal plan, we make the project’s objectives transparent to hear the feedback from the community.

The next wave of stream processing is coming. To learn more, remember to check the RisingWave GitHub repository, join our slack community, follow us on Twitter, and stay tuned with RisingWave!

· 11 min read
Yingjun Wu

As an early-stage database startup, we entirely deleted our C++ codebase after 7 months of development and rewrote everything from scratch in Rust programming language. Here's how we arrived at a decision and why we think this was one of the best decisions ever.

RisingWave is a cloud-native streaming database. The idea behind the system is to reduce the complexity and cost of building real-time applications in the cloud.

When we started building RisingWave in early 2021, we wrote it in C++. The founding team consisted of several seasoned C++ engineers with 10+ years of relevant experience. So, using C++ was a no-brain decision. The first few months of development seemed smooth. We were in full gear building the most incredible database of a new era, dreaming of how RisingWave could shake the modern data stack. We were on a quest for greater effectiveness.

But as more and more engineers joined us, some shortcomings of C++ came to bite us: unreadable coding style, memory leak, segmentation fault, and more. We started to question ourselves: is C++ the right language for us to write a new database system? After around seven months of development, a whole month's debate followed; finally, we made the hard decision to move from C++ to Rust.

What did the decision mean? It meant we, a team of 10+ seasoned engineers, had to rewrite the entire system from scratch! And our seven months of efforts went in vain. It was an insane decision for an early-stage startup. Note that time is almost everything in the furiously competitive world of tech startups.

After making the decision, we spent around two months removing our C++ codebase entirely and rewriting it in Rust. We deleted 276,406 lines of code in total. 10 months have passed since then. Thanks to the decision, RisingWave is sailing through, with its source code accessible for everyone under Apache License 2.0. More than 60 contributors have joined us in developing the cloud-native streaming database. We are proud that RisingWave survived the rewriting; we are pleased that RisingWave gained over 1,600 stars on Github within just a month!

delete source code

The Rust community keeps growing rapidly, and many engineers may consider whether to (re)write their project in Rust, just like what we did 10 months ago. We would love to share our thoughts on how we made the decision: what made us move to Rust, and what kind of pitfalls we confronted.

Now, let's first look back at what went wrong with C++.

Implement RisingWave with C++: the good, the bad, and the ugly

C/C++ is undoubtedly one of the most popular programming languages for building database systems. Most well-known database systems, including MySQL, PostgreSQL, Oracle, and IBM Db2, are created in C/C++. It is still a viable, vital, and relevant language. Choosing C++ won't be a wrong decision to build a brand-new database system, but it doesn't mean that C++ is the best choice, especially for an early-stage startup that aims at innovating a large-scale database system from scratch. To understand the reason, let's review the good, the bad, and the ugly parts of this battle-tested programming language.

Good

  • C++ offers developers the opportunity to develop high-performance programs. It provides fine-grained control over both memory and computation without the overhead of automatic garbage collection. Moreover, C++ code can be compiled into assembly language for direct execution on the OS instead of relying on interpreters or language runtime.
  • C++ has proved to be a feasible language for system programming. Plenty of databases are built in C/C++. Therefore decision makers can believe that choosing C++ is never a bad idea.

Bad

  • C++ offers a lot of flexibility to programmers, but it comes at a price. It is extremely easy to program a bug, and some are highly non-trivial. But, it is super hard to debug C++ programs, especially for concurrent programming.
  • Dependency management can be a hassle. Although there are some tools, for example, CMake, to automatically configure the compilation of C++ projects, developers still need to manually configure and install the dependent libraries.

Ugly

  • The STL library lacks support for some modern programming tools, for example, native co-routine support. As a result, developers must rely on many community projects, and most lack long-term support.
  • Quality assurance is challenging. C++ supports so many features that different developers can code C++ in drastically different styles. As more developers with diverse backgrounds joined our team, we could not maintain readability. Furthermore, bugs in C++ code are non-trivial to identify; hence reviewing C++ code can become daunting.

Why choose Rust over C++

Since C++ is not a bad choice for building a database system, why did we rewrite the whole codebase again? Was it due to the same reason as the majority: it is cool? The answer is no; we decide to move to Rust after careful consideration.

poll

A streaming database is typically used for mission-critical tasks that are incredibly latency-sensitive. Hence, we can only build RisingWave in a language that:

  • guarantees zero-cost abstraction so that we won't have performance capped
  • doesn't require runtime garbage collection so that we can have the latency spike potentially caused by memory management under our control

We cannot compromise on these two essential requirements for cutting-edge performance.

With these goals in mind, we thus chose Rust over C++. Both languages offer developers zero-cost abstraction and complete control of memory management. Rust, in our view, is a much better choice for relieving the developer's mental load and paving the way for efficient and large-scale collaboration. Here are the top four reasons:

  • Rust is safe. Rust guarantees memory safety and thread safety at compile time by introducing ownership rules. It goes beyond RAII, a common memory management mechanism used in C++. There are two advantages. The first is self-evident: once the Rust compiler validates our program, we won't have segmentation faults or data races at runtime, which would otherwise necessitate dozens of hours of debugging, especially in a codebase that is highly concurrent and primarily asynchronous. The second is more subtle: the Rust compiler simply restricts the types of faults, which reduces the intricately intertwined code fragments that can cause such buggy behavior. The replication of bugs is substantially improved with the help of deterministic execution (we'll have more on this in future blogs).
  • Rust is easy to use. C++ is based on the philosophy of giving developers the most degree of freedom. But, sometimes, it backfires. For example, in C++, the template expands at compile time to check whether any operation is uninvokable with a particular type. In Rust, traits constrain the methods that a concrete variety can invoke, so the compiler can check the validity of the type at the call site instead of running expansion. This difference makes C++ template error messages more obscure and often requires seasoned C++ veterans' decipherment. Another example is the widespread abuse of implicit conversion in C++. Implicit conversion may help you code less, but things are more likely to go wrong, and when it really goes wrong, the error would be "implicit" and harder to debug. Check out the Google C++ style guide; implicit conversion only causes more benefits than confusion when properly restricting its usage, especially in a large codebase.
  • Rust is easy to learn. For seasoned C++ programmers, Rust is easy to learn. When they first start out, Rust learners usually spend most of their time making sense of ownership and lifetime. Even if they don't explicitly express these concepts in code, experienced C++ engineers always keep these two concepts in mind when programming in C++. Rust can be challenging for beginners. But our interns proved otherwise. They picked up Rust within one or two weeks — even with no prior Rust/C++ expertise — because there are fewer implicit conversion and overload resolution rules to remember. And examining the basic Rust code is a breeze for our colleagues. Now, we spend far less time reviewing beginners' Rust code than C++ code.
  • Unsafe Rust is manageable. Due to the conservative nature of Rust's static analyzer, we may encounter situations where only unsafe Rust can make the impossible possible. A classic example is creating a self-referential type. Or we must gain extra performance by unsafe, i.e., directly manipulating bits in a compacted memory representation. In either case, it is reasonable for skeptics to ask: will this make the codebase vulnerable? For RisingWave, empirically, no. Two major use cases are LRU cache and Bitmap, which take less than 200 lines out of 170000 lines of code in total. Adopting this approach of first coding in Safe Rust and only resorting to Unsafe when there is concrete evidence and solid arguments is now the secret to our good night's sleep.

Here comes the dark side

While Rust meets most of our requirements, we are also fully aware of the dark sides:

  • Fragmented async ecosystem: Without making initial decisions on async runtime, we spent months getting rid of futures-rs and async-std and switched to tokio-rs finally.
  • Cumbersome error handling: We need to store and implement backtrace capture on errors manually to get a backtrace.
  • Insufficient support of AsyncIterator: Without native support for stable generators and async fn in traits, we used third-party libraries to achieve the same goal. However, these libraries allocate extra Boxes compared to the pending standard implementation, ultimately lowering performance. Also, using macro from these libraries hinders IDE from working properly, making development less programmer-friendly.
  • Practical limitations of Generic Associated Type (GAT): GAT is the foundation for many existing/pending features, e.g., static/dynamic async fn in traits. However, complete support for GAT has complex technical issues that may require longer than expected time to be solved. Before that, we have to use various tricks to bypass limitations or live with suboptimal solutions.

Nevertheless, with so many talented engineers in our team, we find that, overall, Rust improves productivity and code quality significantly while keeping the negative impact under control.

Learning from our experience

This blog is not about convincing every database development team to abandon their entire C++ codebase and rewrite their system in Rust from scratch. Instead, its primary purpose is to tell people why we made such a decision. Rewriting the entire code base is no fun; instead, it's excruciating for a startup where wasting time is suicidal. In fact, despite the apparent benefits brought by Rust, we probably wouldn't have made this tough decision without the following key factors:

  • We were at that time refactoring our code base to adapt to our new system architecture, and rewriting (at least a portion of) the codebase became inevitable.
  • We have a few Rust enthusiasts (Rustaceans!) in our team, and they kept evangelizing Rust to other engineers and convinced the entire team that rewriting in Rust was a practical option.
  • We expanded our engineering team rapidly, and more engineers joined us in the summer of 2021, which significantly accelerated the codebase rewriting.

Summary

Rust is a cool programming language, and everyone should try it. But don't rewrite your project simply because it is cool to do so. If you are considering whether to rewrite your production-level project in Rust, then please ask yourself the following questions:

  • Will low-level programming, performance, memory safety, and package management become a concern for your project?
  • Do you have any Rust experts who can help avoid potential pitfalls?
  • How long will it take to rewrite this project?
  • Will you miss any critical deadlines because of the rewriting?
  • Do you have in-house training programs on Rust?

You can decide after careful deliberation of answers to these questions. Again, Rust (or any other language) will never determine the destiny of your project. But making a wise choice may save you hundreds or even thousands of man-months.

· 9 min read
Yingjun Wu

TL;DR: No

Two weeks ago, we open-sourced RisingWave, a cloud-native streaming database. And in no time, the project gained lots of attention from the open-source community and became the #1 trending project written in Rust!

We're thrilled that RisingWave is gaining interest from the open-source community. Because we're on a mission to democratize stream processing — to make stream processing simple, affordable, and accessible! One more exciting detail: we're building this cloud-native streaming database from scratch.

You're right. There are many streaming systems in the market, and during the last decade, we've seen an abundance of them. Then why did we build another streaming system? One common question from the community is: How is RisingWave different from Apache Flink? And this blog is all about that — a simplified, digestible guide from a developer’s perspective.

Github Trending: https://github.com/trending/rust?since=weekly (screenshot captured on April 17th, 2022)
Github Trending: https://github.com/trending/rust?since=weekly (screenshot captured on April 17th, 2022)

Apache Flink, known initially as Stratosphere, is a distributed stream processing engine initiated by a group of researchers at TU Berlin. Since its initial release in May 2011, Flink has gained immense popularity in both academia and industry. And it is currently the most well-known streaming system globally (challenge me if you think I got it wrong!).

As one of the early contributors who participated in the initial design of Flink’s stream processing framework (also known as Stratosphere Streaming), I witnessed the growth and evolution of this project over the last decade.

Github Trending: https://github.com/trending/rust?since=weekly (screenshot captured on April 17th, 2022)
My proposal on Stratosphere Stream (submitted in 2014)

Born in the big data era, some key ideas behind the design are:

  • Flink positions itself as a stream computation engine: it doesn’t provide data persistence capability. Instead, it focuses on providing real-time data processing ability across thousands of machines. To achieve this, Flink follows the design of MapReduce and provides the users with a set of low-level programming APIs (in Java) that can be used to implement parallel and distributed algorithms on a large cluster.

  • Flink is natively designed for the big data stack: it heavily relies on various big data services (such as Zookeeper, HDFS, and Yarn), and can be perfectly integrated with other big data systems (such as Cassandra and HBase)*. Just like MapReduce, Flink adopts the shared-nothing architecture, where each machine stores and processes its own data and is entirely independent of other machines.

  • By coupling the compute and storage, Flink can obtain extreme scalability, but its elasticity becomes much harder to achieve and manage. Using Flink to empower streaming applications can be super expensive. The system users may have to continuously provide their resources to peak capacity to avoid latency spikes during workload fluctuation.

Flink stands the test of time as reflected by its popularity in the big data world over the past decade. However, when entering the cloud era, everyone is looking for simple and affordable solutions in the cloud. Flink may no longer be the best answer for stream processing. That's what motivated us to build a new cloud-native streaming system. And RisingWave was born!

What is RisingWave?

RisingWave is a cloud-native streaming database built from scratch. I initiated the project in early 2021 after signing off from Amazon Redshift. RisingWave is now open-sourced under Apache License 2.0 after over 15 months of development, and we're building the project with the open-source community.

The internal architecture of RisingWave

To simplify stream processing, we revisited an old line of research: streaming databases. The idea of streaming databases dates back to an early proposal of data stream management systems (DSMSs) to manage data streams the same way as data management in a database. RisingWave, like many other databases, can ingest data, store data, and answer concurrent access requests from end-users. All these requests can be expressed by PostgreSQL-style SQL.

More than just a SQL database system, RisingWave offers the capability of stream processing: it consumes streaming data, performs continuous queries, and maintains results dynamically in the form of a materialized view. Processing data streams inside a database is quite different from that inside a stream computation engine: streaming data are instantly ingested into data tables; queries over streaming and historical data are simply modeled as table joins; query results are directly maintained and updated inside the database, without pushing into a downstream system.

RisingWave is built on top of the modern cloud infrastructure. A key advantage brought by the cloud infrastructure is that RisingWave can scale compute and storage resources separately and infinitely based on the users’ demands. This offers the chance to constantly attain an optimal cost-performance ratio based on the streaming workload. To fully unleash the power of the cloud infrastructure, RisingWave also adopts a tiered architecture: the compute and storage layers are fully decoupled, allowing on-demand scaling, and a special caching layer is added in the middle of these two layers, thus reducing the frequency of remote storage access.

Now let’s get back to our question: how is RisingWave different from Flink? Although both RisingWave and Apache Flink provide stream processing capability for real-time applications, they are different in many ways:

Position

Flink is a stream computation engine that enables users to process streaming data. It does not persist user data and is not optimized for serving concurrent queries issued from users.

RisingWave at its heart is a database system that can consume data and serve queries from end-users. More than a conventional SQL database, RisingWave excels in offering the capability of processing streaming data at low latency.

Objective

Flink was designed to maximize performance given a fixed amount of resources. Its compute-storage-coupled architecture enables it to achieve infinite scaling at a high cost.

RisingWave cares about performance. On top of that, RisingWave is tailored for optimizing the cost efficiency. It adopts a tiered architecture that fully utilizes the cloud resources to give the users fine-grained control over cost and performance.

Interface

Flink natively provides a low-level Java API. While the low-level API offers experts the ability to fully control the underlying dataflow, it is tough to develop and debug, especially for system amateurs. Flink also provides SQL interfaces, but its big-data dialect requires users to pay special attention when writing any SQL queries.

RisingWave offers a standard SQL interface that is compatible with PostgreSQL. Users can process data streams in the same way as they use PostgreSQL.

Ecosystem

Flink is a big data system deeply integrated with the big data ecosystem. Users need to deploy Flink with other open-source big data systems, such as Zookeeper, HDFS, and Yarn.

RisingWave, however, is a cloud-native system that is fully integrated with the cloud ecosystem. RisingWave serves as an integral part of the modern data stack: users can connect RisingWave with any of their favorite systems in the cloud, such as Apache Kafka, Apache Pulsar, Redpanda, Amazon Kinesis, and many others. RisingWave’s wire compatibility with PostgreSQL further allows users to access the prosperous PostgreSQL ecosystem.

Operation

Flink is fairly difficult and expensive to operate since it is deeply rooted in the big data stack. Operating Flink essentially requires users to use several systems, including Zookeeper, HDFS, and many others. Users may have to manually reconfigure the system to achieve higher efficiency in confronting workload fluctuation.

The cloud-native design of RisingWave dramatically lowers the bar for system operations. RisingWave does not depend on any big data services, and its decoupled compute and storage architecture vastly simplifies the management of elasticity.

Target Users

Flink’s target users are mostly data engineers and experts. The operation cost and the learning curve make it difficult for amateur users to fully harness the power of Flink.

RisingWave targets not only engineers but also data scientists, analysts, decision-makers, and anyone who needs fresh results. It can be easily deployed and maintained in the cloud, and elasticity can be achieved with little human interference. Its PostgreSQL-style SQL also makes it simple for users to program their streaming applications.

Conclusion

Flink is a big data engine, whereas RisingWave is a cloud database.

RisingWave is not the next Flink. It is the answer for stream processing in the cloud era.

We believe that a great product comes from the combined wisdom of a vibrant and open community. Check out the source code at GitHub to get involved, and join our Slack community for lively discussions!


Note

*This was true historically, but more and more users are moving towards Kubernetes with some distributed file systems or object stores for persistent storage. The Hadoop ecosystem has already been made completely optional in the classpath. The Apache Flink community is intentionally facilitating this, with efforts like https://github.com/apache/flink-kubernetes-operator. Credit to: Márton Balassi (Apache Flink PMC)

· 5 min read
Yingjun Wu

Stream processing is a well-studied topic. Over the last few decades, researchers and practitioners have devoted substantial efforts to developing fast, scalable, and reliable streaming systems tailored for real-time analytics over streaming data. Thanks to these efforts, today, open-source and commercial streaming systems are running smoothly in big tech’s data centers, powering thousands of applications covering ads recommendation, fraud detection, IoT analytics, and many others.

Witnessing the significant progress made in the stream processing field, more and more companies have started investigating modern streaming systems, hoping to see how modern technologies can transform their businesses. Unfortunately, many of these companies get stuck in their journey, complaining about the high cost of adopting streaming systems, for at least two sets of reasons:

  • Difficult to learn. It was never easy to learn how to use streaming systems. Unlike conventional databases (e.g., MySQL and PostgreSQL) that provide SQL as the interactive interface, most, if not all, of the streaming systems require users to learn a set of platform-specific programming interfaces (most likely in Java) to manipulate the streaming data. Mastering streaming systems becomes a mission impossible for non-tech persons. To make things worse, streaming systems represent data in a different way from databases. Users have to write complicated data extraction logic to transit data between streaming systems and databases.

  • Expensive to operate. Many popular streaming systems are open-source. Comprehensive deployment scripts and docker images are easy to get. But open-source never implies being free or affordable. Streaming workloads can fluctuate abruptly based on the usage demand. Companies have to purchase a large cluster of machines to sustain worst-case scenarios. The cost of deploying and maintaining a streaming system can be much more than just purchasing machines. In fact, assembling a team of engineers that are willing to burn their midnight oil to operate the system can be a real headache.

Democratizing stream processing

Stream processing should not be the privilege of big techs and deep pockets. Stream processing should not be treated as a monster whose power can only be harnessed by talented engineers. Stream processing should benefit everyone, from data scientists to decision makers, from large enterprises to small businesses. At Singularity Data, we invest all our efforts in democratizing stream processing. We are building RisingWave, a cloud-native streaming database that makes stream processing simple, affordable, and accessible to everyone.

Stream processing made simple

RisingWave is a distributed streaming database that provides standard SQL as the interactive interface. It speaks in PostgreSQL dialect, and can be seamlessly integrated with the PostgreSQL ecosystem with no code change. RisingWave treats streams as tables and allows users to compose complex queries over streaming data and historical data declaratively and elegantly. With RisingWave, users can now purely focus on their analytical query logics, without worrying about learning Java or system-specific low-level APIs.

Stream processing made affordable

RisingWave is designed for the cloud. The cloud-native architecture enables RisingWave to fully leverage elastic resources provided by the cloud platforms. As a fully managed service, RisingWave deploys, maintains, and scales in the cloud on its own, without human interference from the user side. Once users set their service-level agreement (SLA), RisingWave will automatically assemble different tiers of compute and storage resources in the cloud to achieve the performance goal at a minimal cost. RisingWave is serverless: users pay for the service on an as-used basis, and users do not need to pay unless they use the service. We keep optimizing the service to ensure that RisingWave is affordable even for small businesses.

Stream processing made accessible

We believe that a great product comes from the collective wisdom of a thriving and open community. Instead of developing RisingWave by relying on the experience of a small group of experts, we design and implement it with the open-source community. We decided to open-source RisingWave kernel under Apache License 2.0, a permissive free software license. The RisingWave community is open: everyone can participate in the design of the RisingWave project roadmap; everyone can deploy the distributed streaming database in their own cloud service; everyone can contribute and send feedback to the community. The RisingWave community is collaborative: we are eager to build the modern real-time data infrastructure stack together with other communities. For example, we are actively working with the community of Redpanda, a real-time streaming platform, to unleash productivity in building mission-critical applications.

The next wave of stream processing is coming. Please join us to help define the future of stream processing, for everyone!