Introduction

Stream processing and real-time analytics have become some of the most important topics in Big Data. Noticeably, the industry tends to develop more robust, powerful and intelligent stream processing applications. Fraud detection for instant payments, scoring of consumers on websites and shops, claims analysis and cost estimates, image processing for surveillance, food, and agriculture, etc. are only a few potential applications of real-time stream processing and analytics.

The recent introduction of stateful stream processing [9,14,16] has enabled the development of a new kind of real-time applications. Indeed, hot and cold data have been combined into a single real-time data flow using the concept of Stream Tables [15, 16]. The concept of duality between Streams and Tables is not recent. It was first introduced in 2003 as a “Relation to Stream” transformation, called STREAM [20]. However, it is only with the emergence of state management [14] that Stream Tables can now be used in real-time and in a completely distributed manner.

Furthermore, stateful stream processing has been applied in data management using Stream & Complex Event Processing (CEP) or Composite Event Recognition (CER) [20]. New architecture patterns were proposed to resolve data pipelines and data management within the enterprise. For instance, the authors in [11,12] proposed new designs for the Extract, Transform and Load (ETL) steps based on stream processing. Thus, by breaking down silos between Enterprise Data Warehouses (EDW) and Big Data lakes [13], doors have been opened to completely redesign the way data are transported, stored and used within the Big Data environment. More recently, Friedman et al. described in [21] how a Data Hub can be implemented to store and distribute data within an enterprise context.

In the past few years, researchers and practitioners in the area of data stream management and CEP/CER [1, 2, 3, 4, 5] have developed systems to process unbounded streams of data and quickly detect situations of interest. Nowadays, big data technologies provide a new ecosystem to foster research in this area [6]. Highly scalable distributed stream processors, the convergence of batch and stream engines, and the emergence of state management & stateful stream processing (such as Apache Spark [9], Apache Flink [10], Kafka Stream [18, 19], Google dataflow [17]) opened up new opportunities for highly scalable and distributed real-time analytics. Going further, these technologies also provide solid-foundation algorithms complementary to the CEP/CER in the use cases required by the industry. As a result, with the stateful nature of stream processors [14], stream SQL statements can be applied directly in the streaming engine and dynamic tables can be created [12, 15, 18]. Besides, formalisms for reasoning on durative events have appeared in the past and they were introduced for improving CER [22, 23, 24]. This led to the introduction of Stream Reasoning for improving Stream Mining tasks, autonomous cars or drones and many other use cases.

For the present workshop, and following the discussion above, submissions studying scalable online learning, incremental learning on stream processing infrastructures, Complex event processing and Composite event recognition are welcomed. We also encourage submissions on data stream management, data architecture using stream processing and the Internet of Things (IoT) data streaming. Additionally, we appreciate submissions studying the usage of stream processing in new innovative architectures.

After the success of the first four editions of this workshop, co-located with the IEEE Big Data 2016 & 2017 & 2018 & 2019, this fifth edition will be an excellent opportunity to bring together actors from academia and industry to discuss, explore and define new opportunities and use cases. The workshop will benefit both researchers and practitioners interested in the latest research in real-time and stream processing. It will showcase prototypes or products leveraging big data technologies as well as models, efficient algorithms for scalable CEP/CER and context detection engines, and also new architectures leveraging stream processing.

REFERENCES

[1] E. Alevizos, A. Skarlatidis, A. Artikis, and G. Paliouras. “Probabilistic complex event recognition: A survey”. ACM Comput. Surv., 50(5):71:1– 71:31, 2017.
[2] Cugola, Gianpaolo, and Alessandro Margara. "Complex event processing with T-REX" Journal of Systems and Software 85.8: 1709-1728. 2012.
[3] I. Kolchinsky, I. Sharfman, and A. Schuster. “Lazy evaluation methods for detecting complex events”. In Proceedings of the 9th ACM International Conference on Distributed Event-Based Systems, DEBS 15, page 3445. ACM, 2015.
[4] Abadi, Daniel J et al. "The Design of the Borealis Stream Processing Engine." CIDR 4: 277-289. 2015.
[5] Agrawal, Jagrati et al. "Efficient pattern matching over event streams." Proceedings of the 2008 ACM SIGMOD international conference on Management of data 9 Jun. 2008: 147-160.
[6] N. Giatrakos, E. Alevizos, A. Artikis, A. Deligiannakis, and M. Garofalakis. “Complex event recognition in the big data era: A survey.” VLDB Journal, 2019.
[7] Confluent blog post: Event Sourcing, CQRS, Stream Processing and Apache Kafka: What’s the connection?
[8] Confluent blog post: A practical guide to build a stream data platform
[9] Matei Zaharia and al.: “Discretized Streams: Fault-Tolerant Streaming Computation at Scale”. Proceedings of the SOSP Conference. 2013
[10] Paris Carbone and al. : “Apache Flink™: Stream and Batch Processing in a Single Engine”. In the Bulletin of the IEEE Computer Society Technical Committee on Data Engineering. 2015
[11] Neha Narkhede, ETL is dead, Long Live Streams . December 2016
[12] Tathagata Das, Real-time Streaming ETL with Structured Streaming in Apache Spark 2.1. January 2017
[13] Michael Ambrust, Databricks Delta: A Unified Data Management System for Real-time Big Data October 2017.
[14] Paris Carbone and al., “State Management in Apache Flink™, Consistent Stateful Distributed Stream Processing”. In the proceeding of VLDB 2017.
[15] Fabian Hueske, Continuous Queries on Dynamic Tables. April 2017.
[16] Nico Kruber,A Journey to Beating Flink's SQL Performance February 2020.
[17] Tyler Akidau and al. "The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing". In the Proceedings of the VLDB Endowment, vol. 8, pp. 1792-1803. 2015.
[18] KStream Concepts, KTables, consulted in March 2020.
[19] Abhishek Gupta, Learn stream processing with Kafka Streams: Stateless operations, March 2020.
[20] Arasu and al. , “STREAM: The Stanford Data Stream Management System”. In the proceedings of SIGMOD 2003.
[21] Ted Friedman and al., “Implementing the Data Hub: Architecture and Technology Choices”. Gartner Report, August 2018.
[22] Foundation of Composite Event Recognition - Daghstul Seminar, February 2020.
[23] Artikis, A., Sergot, M.J., Paliouras, G.: An event calculus for event recognition. IEEE Trans. Knowl. Data Eng. 27(4), 895–908. 2015.
[24] Daniele Dell'Aglio, Emanuele Della Valle, Frank van Harmelen, Abraham Bernstein: Stream reasoning: A survey and outlook. Data Sci. 1(1-2): 59-83. 2017.
[25] Harald Beck, Minh Dao-Tran, Thomas Eiter: LARS: A Logic-based framework for Analytic Reasoning over Streams. Artif. Intell. 261: 16-70. 2018

Possible Extended Versions in a Peer-Reviewed Journal

The accepted papers will have the opportunity to submit an extended version of their workshop paper in a Special Edition of the MDPI Data Journal . Data (ISSN 2306-5729) is a peer-reviewed open access journal on data in science, with the aim of enhancing data transparency and reusability. The journal is now included in the Emerging Sources Citation Index - ESCI (Web of Science), Scopus and Inspec (IET).

Research Topics

The topics of interest include but are not limited to:

New stream
processing architecture
for big data.
Complex Event Processing (CEP)
for big data, pattern
matching engines
for big data.
Composite Event Recognition (CER)
Stream Reasoning
Scalable real-time
decision algorithms.
Scalable stream
processing architecture,
algorithms or models.
Stream mining and algorithms
Online & incremental learning
Stream SQL and other
continuous query
languages on big data
frameworks.
Data pipelines & Data management with Streams.
Stream ETL and Real-Time Data Warehouse.
Online & Incremental Learning and algorithms.
New or innovative architecture pattern leveraging stream processing
IoT analytics

Keynotes

Keynote 1: The spectrum of stream processing with Apache Flink - Till Rohrmann VERVERICA (The creators of Apache Flink)

Stream processing is gaining more and more attention these days as more and more companies realize the benefit of faster insights and, thereby, faster decision making. First, people started looking into streaming analytics because of the similarity to its batch counterpart. More recently, it became apparent that stateful stream processing can also be a building block for event driven applications. These applications not only allow us to analyze data but also to act on it, which opens up a new field of use cases for modern stream processors.

In this talk, I want to demonstrate Flink’s capabilities to support different streaming use cases, ranging from model training over streaming analytics to event-driven applications. First we will take a look at how Flink enables us to process bounded and unbounded streams of data. Next, we will see how SQL and Flink’s Table API can be used to analyze streams of data in a declarative way. Last but not least, I want to present the Stateful Functions API which allows to develop event-driven applications on top of Flink. I will conclude the talk by giving an outlook on future features of Flink which will further broaden the spectrum of supported streaming use cases.

About the Speaker: Till Rohrmann is a PMC member of Apache Flink and lead software engineer at Ververica. His main work focuses on enhancing Flink’s scalability as a distributed system. Till studied computer science at TU Berlin, TU Munich and École Polytechnique where he specialized in machine learning and massively parallel dataflow systems.

Keynote 2: Scaling Pulsar Functions - Sanjeev Kulkarni SPLUNK

Pulsar functions bring serverless concepts to the messaging and streaming world by providing the simplest possible API for writing stream processing transformations. This simplicity attracts lots of developers to write and deploy their algorithms. Combined with Pulsar's built-in multi-tenancy features and its ability to support millions of topics, clusters running hundreds of thousands of function instances are the result. In this talk I detail work being done to scale up Pulsar functions framework to tackle such enormous workloads.

About the Speaker: Sanjeev Kulkarni comes from Splunk where he works on their Data Streaming Processing product. Prior to Splunk, Sanjeev was the co-founder and CTO of Streamlio which led the development of Apache Pulsar. Prior to Streamlio, Sanjeev led the streaming-computer team at Twitter where he and his team were responsible for developing next generation technologies for Twitter's ever growing real-time needs, dubbed Apache Heron. He was also an early member of Google's Adsense product. He holds a B.Tech in Computer Sciences from IIT Guwahati and MS in Computer Science from UW-Madison.

Keynote 3: : How to architect data pipelines with Structured Streaming in Apache Spark - Tathagata Das DATABRIKS

Structured Streaming has proven to be one of the best platform for building distributed stream processing applications. Its unified SQL/Dataset/DataFrame APIs and Spark’s built-in functions make it easy for developers to express complex computations. Furthermore, there are newer projects in the open-source big data ecosystem, that provide scalable storage of structured data with ACID transactions to Apache Spark. Together, these can make it very easy to build pipelines in many common scenarios. However, expressing the business logic is only part of the larger problem of building end-to-end streaming pipelines that interact with a complex ecosystem of storage systems and workloads. It is important for the developer to truly understand the business problem that needs to be solved. Apache Spark, being a unified analytics engine doing both batch and stream processing, often provides multiples ways to solve the same problem. So understanding the requirements carefully helps you to architect your pipeline that solves your business needs in the most resource-efficient manner.

In this talk, I am going to examine a number of common streaming design patterns in the context of the following questions.

- WHAT are you trying to consume? What are you trying to produce? What is the final output that the business wants? What are your throughput and latency requirements?
- WHY do you really have those requirements? Would solving the requirements of the individual pipeline actually solve your end-to-end business requirements?
- HOW are going to architect the solution? And how much are you willing to pay for it?

About the Speaker: Tathagata Das is a Staff Software Engineer at Databricks, an Apache Spark committer, and a member of the Apache Spark Project Management Committee (PMC). He is one of the original developers of Apache Spark, the lead developer of Spark Streaming (DStreams), and is currently one of the core developers of Structured Streaming and Delta Lake. Previously, he was a graduate student at AMPLab, UC Berkeley where he conducted research about data-center frameworks and networks with Scott Shenker and Ion Stoica.

Programme

The workshop is held on thursday December 10

Time	Title	Author(s)
9:00 - 09:45 (US EST Time)	Keynote 1: The spectrum of stream processing with Apache Flink	Till Rohrmann - Ververica
9:45 - 10:30 (US EST Time)	Keynote 2: Scaling Apache Pulsar Functions	Sanjeev Kulkarn - Splunk
10:30 - 11:15 (US EST Time)	Keynote 3: How to architect data pipelines with Structured Streaming in Apache Spark	Tathagata Das - Databricks
11:15 - 11:25 (US EST Time)	Coffee Break
11:25 - 11:40 (US EST Time)	Optimizing Convergence for Iterative Learning of ARIMA for Stationary Time Series	Kevin Styp-Rekowski, Florian Schmidt, and Odej Kao
11:40 - 11:55 (US EST Time)	Extending Kafka Streams for Complex Event Recognition	Samuele Langhi, Riccardo Tommasini, and Emanuele Della Valle
11:55 - 12:20 (US EST Time)	Streaming Time Series Forecasting using Multi-Target Regression with Dynamic Ensemble Selection	Dihia Boulegane, Albert Bifet, Haytham Elghazel, and Giyyarpuram Madhusudan
12:20 - 12:45 ( (US EST Time)	Flexible Executor Allocation without Latency Increase for Stream Processing in Apache Spark	Yuta Morisawa, Masaki Suzuki, and Takeshi Kitahara
12:45 - 12:55 (US EST Time)	Coffee Break
12:55 - 13:20 (US EST Time)	Smart Resource Management for Data Streaming using an Online Bin-packing Strategy	Oliver Stein, Ben Blamey, Johan Karlsson, Alan Sabirsh, Ola Spjuth, Andreas Hellander, and Salman Too
13:20 - 14:45 (US EST Time)	Monitoring Networks with Queries Evaluated by Edge Computing	Quangtri Thai, Carlos Ordonez, and Omprakash Gnawali,
14:15 - 15:10 (US EST Time)	HerdMonitor: Monitoring Live Migrating Container Resource and Performance Metrics in Cloud Environments"	Alejandro Gonzalez and Emmanuel Arzuaga
15:10 ( (US EST Time)	Closing Remarks

Information

IMPORTANT DATES

SUBMISSION DEADLINE: October 8, 2020 (extended)
DECISION NOTIFICATION: November 1, 2020
CAMERA-READY SUBMISSION DEADLINE: November 15, 2020
Workshop: December 10-13, 2020 (Virtually held)

PUBLICATIONS

Your paper should be written in English and formatted to IEEE Computer Society Proceedings Manuscript Formatting Guidelines (Templates). The length of the paper should not exceed 6 pages.

All accepted papers will be published in the Workshop Proceedings by the IEEE Computer Society Press

SUBMIT PAPER

PROGRAM CO-CHAIRS

Sabri Skhiri
EURA NOVA, BE
Albert Bifet
Télécom Paris Tech, FR
Alessandro Margara
Politecnico di Milano, IT

PROGRAM COMMITTEE MEMBERS

Till Rohrmann,
Ververica/Alibaba, DE
Vijay Raghavan
University of Louisiana, US
Raju Gottumukkala
University of Louisiana, US
Jian Chen,
University of North Alabama, US
Nam-Luc Tran,
SWIFT, BE
Guido Salvaneschi,
TU Darmstadt, DE
Fabricio Enembreck,
Pontifícia Universidade Católica do Paraná, BR
José del Campo Ávila
Universidad de Málaga, ES

Amine Ghrab,
EURA NOVA, BE
Thomas Peel,
EURA NOVA, FR
Oscar Romero,
UPC Barcelona, ES
Hai-Ning Liang,
Xi’an Jiaotong-Liverpool University, CN

5th Workshop
on Real-time Stream Analytics,
Stream Mining, CER/CEP
& Stream Data Management
in Big Data

COLOCATED WITH
THE 2020 IEEE INTERNATIONAL
CONFERENCE ON BIG DATA