Introduction

Stream processing and real-time analytics have become some of the most important topics in Big Data. Noticeably, the industry tends to develop more robust, powerful and intelligent stream processing applications. Fraud detection for instant payments, scoring of consumers on websites and shops, claims analysis and cost estimates, image processing for surveillance, food, and agriculture, etc. are only a few potential applications of real-time stream processing and analytics.

The recent introduction of stateful stream processing [9,14,16] has enabled the development of a new kind of real-time applications. Indeed, hot and cold data have been combined into a single real-time data flow using the concept of Stream Tables [15, 16]. The concept of duality between Streams and Tables is not recent. It was first introduced in 2003 as a “Relation to Stream” transformation, called STREAM [20]. However, it is only with the emergence of state management [14] that Stream Tables can now be used in real-time and in a completely distributed manner.

Furthermore, stateful stream processing has been applied in data management using Stream & Complex Event Processing (CEP) or Composite Event Recognition (CER) [20]. New architecture patterns were proposed to resolve data pipelines and data management within the enterprise. For instance, the authors in [11,12] proposed new designs for the Extract, Transform and Load (ETL) steps based on stream processing. Thus, by breaking down silos between Enterprise Data Warehouses (EDW) and Big Data lakes [13], doors have been opened to completely redesign the way data are transported, stored and used within the Big Data environment. More recently, Friedman et al. described in [21] how a Data Hub can be implemented to store and distribute data within an enterprise context.

In the past few years, researchers and practitioners in the area of data stream management and CEP/CER [1, 2, 3, 4, 5] have developed systems to process unbounded streams of data and quickly detect situations of interest. Nowadays, big data technologies provide a new ecosystem to foster research in this area [6]. Highly scalable distributed stream processors, the convergence of batch and stream engines, and the emergence of state management & stateful stream processing (such as Apache Spark [9], Apache Flink [10], Kafka Stream [18, 19], Google dataflow [17], Microsoft Trill [26]) opened up new opportunities for highly scalable and distributed real-time analytics. Going further, these technologies also provide solid-foundation algorithms complementary to the CEP/CER in the use cases required by the industry. As a result, with the stateful nature of stream processors [14], stream SQL statements [27] can be applied directly in the streaming engine and dynamic tables can be created [12, 15, 18].

Besides, formalisms for reasoning on durative events have appeared in the past and they were introduced for improving CER [22, 23, 24]. This led to the introduction of Stream Reasoning for improving Stream Mining tasks, autonomous cars or drones and many other use cases. For the present workshop, and following the discussion above, submissions studying scalable online learning, incremental learning on stream processing infrastructures, Complex event processing and Composite event recognition are welcomed. We also encourage submissions on data stream management, data architecture using stream processing and the Internet of Things (IoT) data streaming. Additionally, we appreciate submissions studying the usage of stream processing in new innovative architectures.

After the success of the first five editions of this workshop, co-located with the IEEE Big Data 2016 & 2017 & 2018 & 2019 & 2020, this sixth edition will be an excellent opportunity to bring together actors from academia and industry to discuss, explore and define new opportunities and use cases. The workshop will benefit both researchers and practitioners interested in the latest research in real-time and stream processing. It will showcase prototypes or products leveraging big data technologies as well as online learning models, efficient algorithms for scalable CEP/CER and context detection engines, and also new architectures leveraging stream processing.

Finally, as our workshop places emphasis on reproducibility, we also encourage authors to make available all data used for empirical evaluations, the related software as well as clear instructions for reproducing the presented experiments. This can be added as a form of supplementary material. The reviewers will be encouraged to consider this material.

Research Topics

The topics of interest include but are not limited to:

Keynotes

Keynote 1: A Year in Flink: The most important changes of the last versions and what's coming next

Another year has passed and the Flink community was very busy creating 3 new major releases containing more than 7000 commits and 3000 resolved issues. The included changes range from the batch execution over the reactive mode to support for stateful Python UDFs. As a community member, it can be an increasingly difficult task to stay on top of all these developments and to understand what benefits the individual features bring. In this talk, I want to walk you through Flink’s most important new features which have been completed in the past year. I will explain the benefits and limitations of the new additions and tell you how they fit into the bigger picture. Moreover, I want to give an outlook on a possible future vision for Apache Flink.

About the Speaker: Till Rohrmann is a PMC member of Apache Flink and software engineer at Ververica. His main work focuses on enhancing Flink’s scalability as a distributed system. Till studied computer science at TU Berlin, TU Munich and École Polytechnique where he specialized in machine learning and massively parallel dataflow systems.

Keynote 2: Managing State in ksqlDB and Kafka Streams

Stateful operations are critical for streaming data applications. But with the addition of state comes the responsibility to manage failures and provide robust and efficient backups. In a distributed system, it's not a matter of if a failure occurs, but when. In this talk, I will cover how Kafka Streams provides near-instantaneous failover for stateful operations. Additionally, I'll also go into recent improvements to that failover logic to support high uptime, even when expanding and contracting the cluster. Since ksqlDB uses Kafka Streams as the stream processing engine, I'll also cover how these changes work for ksqlDB. Finally, I'll go into some of the interesting new work coming in ksqlDB.

About the Speaker: Bill Bejeck is a committer and PMC member for Apache Kafka, with most of his experience in Kafka Streams. Bill wanted to do more teaching after some time as an engineer on the Streams team, so he moved to his current position in the DevX team, where he continues to focus on stream processing with ksqlDB and Kafka Streams. Before Confluent, Bill spent several years on various big-data projects as a federal government contractor.

Programme

The workshop is held on Wednesday December 15

Time

Title

Author(s)

9:00 - 09:45 (US EST Time)

Keynote 1: A Year in Flink: The most important changes of the last versions and what's coming next

Till Rohrmann - Ververica

9:45 - 10:30 (US EST Time)

Keynote 2: Managing State in ksqlDB and Kafka Streams

Bill Bejeck- Confluent

10:30 - 10:45 (US EST Time)

Coffee Break

10:45 - 11:00 (US EST Time)

A GPU Algorithm for Detecting Contextual Outliers in Multiple Concurrent Data Streams (S12203)

Abinash Borah, Le Gruenwald, Eleazar Leal, and Egawati Panjei

11:00 - 11:15 (US EST Time)

Reducing numerical precision preserves classification accuracy in Mondrian Forests (S12201)

Marc Vicuna, Martin Khannouz, Gregory Kiar, Yohan Chatelain, and Tristan Glatard

11:15 - 11:40 (US EST Time)

PASCAL-G: a Probabilistic Stream Clustering Analysis on Graphs (BigD674)

Nidia Yadira Vaquera Chavez and Trilce Estrad

11:40 - 12:05 ( (US EST Time)

Matrix Profile Index Approximation for StreamingTime Series (BigD550)

Maryam Shahcheraghi, Trevor Cappon, Samet Oymak, Evangelos Papalexakis, Eamonn Keogh, Zachary Zimmerman, and Philip Brisk

12:05 - 12:15 (US EST Time)

Coffee Break

12:15 - 12:40 (US EST Time)

RStream: Simple and Efficient Batch and Stream Processing at Scale (BigD291)

Alessio Fino, Alessandro Margara, Gianpaolo Cugola, Marco Donadoni, and Edoardo Morassutt

12:40 - 13:05 (US EST Time)

Temporal Pattern Recognition in Graph Data Structures (BifgD250)

Pietro Daverio, Hassan Nazeer Chaudhry, Alessandro Margara, and Matteo Rossi

13:05 - 13:15 ( (US EST Time)

Closing Remarks

Information

IMPORTANT DATES

SUBMISSION DEADLINE
October 8, 2021 (extended)
DECISION NOTIFICATION
November 1, 2021
November 7, 2021
CAMERA-READY
SUBMISSION DEADLINE
November 15, 2021
Workshop
December 15-18, 2021

PUBLICATIONS

Your paper should be written in English and formatted to IEEE Computer Society Proceedings Manuscript Formatting Guidelines (Templates). The length of the paper should not exceed 6 pages.

All accepted papers will be published in the Workshop Proceedings by the IEEE Computer Society Press

SUBMIT PAPER

PROGRAM CO-CHAIRS

  • Sabri Skhiri
    EURA NOVA, BE
  • Albert Bifet
    Télécom Paris Tech, FR
  • Alessandro Margara
    Politecnico di Milano, IT

PROGRAM COMMITTEE MEMBERS

  • Till Rohrmann,
    Ververica/Alibaba, DE
  • Vijay Raghavan
    University of Louisiana, US
  • Raju Gottumukkala
    University of Louisiana, US
  • Jian Chen,
    University of North Alabama, US
  • Nam-Luc Tran,
    SWIFT, BE
  • Pascal Weisenburger
    University of St. Gallen, CH
  • Fabricio Enembreck
    Pontifícia Universidade Católica do Paraná, BR
  • José del Campo Ávila
    Universidad de Málaga, ES
  • Amine Ghrab,
    EURA NOVA, BE
  • Thomas Peel,
    GSK, BE
  • Oscar Romero,
    UPC Barcelona, ES
  • Hai-Ning Liang,
    Xi’an Jiaotong-Liverpool University, CN