CLAIMS is an open source system for concurrent data ingestion & data analysis running on a distributed in-memory cluster. CLAIMS extends traditional parallel database technologies with runtime dynamic execution as well as high throughput concurrent data ingestion to achieve real-time data analysis together with data which are being ingesting right after newly produced, as well as provides expressive and optimizable programming interfaces and at the same time efficient and scalable execution.

Background

There is an imperative need currently for real-time data analysis over massive historical data together with newly produced data to enhance both of analysis accuracy and business values. The advocated Lambda Architecture proposed by both Google and Twitter which combines both of real-time stream processing and batch processing, now is showing its drawbacks in the complexity of the post-processing, system debug and etc. Moreover, some of the mission-critical applications (i.e., the real-time general ledger in banks) require higher consistency level over data processing, which is hard or even impossible to be supported in Lambda Architecture. This means that general-purpose in-memory cluster computing platforms are in need, which are capable in processing concurrent data ingestion & real-time data analysis, and are more efficient, easier to use, and can support more applications than the existed Lambda Architecture, but are not restricted to a specific data model and language (such as the relational model and a variant of SQL). CLAIMS fulfils these needs.

Highlights

We highlight the main characters of CLAIMS here:

  • High Throughput Data Ingestion CLAIMS separates data stores into write-optimized data stores and read-optimized data stores, which are orchestrated together to ensure the consistency on query results without blocking each other, and consequently result in high throuphput concurrenty query processing. Furthermore, CLAIMS provides a novel runtime multi-version concurrency control on B+-trees index for data being ingesting, and evently achive data analysis with higher consistency (i.e.,conflict-serializability) over data which are newly produced.

  • Elastic Runtime Execution CLAIMS has a distributed runtime which is based on the principles of parallel databases, providing true pipelining. CLAIMS’ distributed runtime has an embedded mechanism to dynamically adjust the internal parallelism in terms of current workloads in the cluster (i.e., CPU cycles usage, memory usage, network usage) to achieve balanced workload across query pipeline, even with a higher through-put, finally achieving the elastic execution over the generated static physical query plan.