Introduction

Stream processing and real-time analytics have become some of the most important topics of Big Data. Noticeably, the industry tends to develop more robust, powerful and intelligent stream processing applications. Fraud detection for instant payments, scoring of consumers on websites and shops, claims analysis and cost estimates, image processing for surveillance, food, and agriculture, etc,are only some potential applications of real-time stream processing and analytics.

The recent introduction of stateful stream processing [9, 14,16] has enabled the development of a new kind of real-time applications. Indeed, hot and cold data have been combined into a single real-time data flow using the concept of Stream Tables [16, 15]. We have to notice that the concept of duality between Streams and Tables is not recent. It was first introduced in 2003 as “Relation to Stream” transformation, called STREAM [18]. However, it is only with the emergence of state management [14] that Stream Tables can now be used in real time and in a completely distributed manner.

Furthermore, stateful stream processing has been applied in data management using Stream & Complex Event Processing (CEP). New architecture patterns were proposed to resolve data pipelines and data management within the enterprise. For instance, the authors in [11,12] proposed new designs for the Extract, Transform and Load (ETL) steps based on stream processing. Thus, by breaking down silos between Enterprise data warehouses (EDW) and Big Data lakes [13], doors have been opened to completely redesign the way data are transported, stored and used within the Big Data environment. More recently, Friedman et al. describe how a Data Hub can be implemented to store and distribute data within an enterprise context.

In the past few years, researchers and practitioners in the area of data stream management [1, 2, 3] and CEP [4, 5, 6] have developed systems to process unbounded streams of data and quickly detect situations of interest. Nowadays, big data technologies provide a new ecosystem to foster research in this area. Highly scalable distributed stream processors, the convergence of batch and stream engines, and the emergence of state management & stateful stream processing (such as Apache Spark [9], Apache Flink [10], Kafka Stream [17]) opened up new opportunities for highly scalable and distributed real-time analytics. Going further, these technologies also provide solid-foundation algorithms complementary to the CEP in the use cases required by the industry. Finally, with the stateful nature of stream processors [14], stream SQL statements can be applied directly in the streaming engine and dynamic tables can be created [12, 15, 16].

For the present workshop, and following the discussion above, submissions studying scalable online learning, and incremental learning on stream processing infrastructures are welcomed. We also encourage submissions on data stream management, data architecture using stream processing and the Internet of Things (IoT) data streaming. Additionally, we appreciate submissions studying the usage of stream processing in new innovative architectures.

After the success of the first three editions of this workshop, co-located with the IEEE Big Data 2016, 2017 and 2018, this fourth edition will be an excellent opportunity to bring together actors from academia and industry to discuss, explore and define new opportunities and use cases. The workshop will benefit both researchers and practitioners interested in the latest research in real-time and stream processing. It will showcase prototypes or products leveraging big data technologies as well as models, efficient algorithms for scalable CEP and context detection engines, and also new architectures leveraging stream processing.

REFERENCES

[1] Abadi, Daniel J et al. "The Design of the Borealis Stream Processing Engine." CIDR 4 Jan. 2005: 277-289.
[2] Abadi, Daniel J et al. "Aurora: a new model and architecture for data stream management." The VLDB Journal—The International Journal on Very Large Data Bases 12.2 (2003): 120-139.
[3] Chandrasekaran, Sirish et al. "TelegraphCQ: continuous dataflow processing." Proceedings of the 2003 ACM SIGMOD international conference on Management of data 9 Jun. 2003: 668-668.
[4] Cugola, Gianpaolo, and Alessandro Margara. "Complex event processing with T-REX." Journal of Systems and Software 85.8 (2012): 1709-1728.
[5] Agrawal, Jagrati et al. "Efficient pattern matching over event streams." Proceedings of the 2008 ACM SIGMOD international conference on Management of data 9 Jun. 2008: 147-160.
[6] Brenna, Lars et al. "Cayuga: a high-performance event processing engine." Proceedings of the 2007 ACM SIGMOD international conference on Management of data 11 Jun. 2007: 1100-1102.
[7] Confluent blog post: Event Sourcing, CQRS, Stream Processing and Apache Kafka: What’s the connection ?
[8] Confluent blog post: A practical guide to build a stream data platform
[9] Matei Zaharia andal: “Discretized Streams: Fault-Tolerant Streaming Computation at Scale”. Proceedings of the SOSP Conference. 2013
[10] Paris Carbone and al. : “Apache Flink™: Stream and Batch Processing in a Single Engine”. In the Bulletin of the IEEE Computer Society Technical Committee on Data Engineering. 2015
[11] Neha Narkhede, ETL is dead, Long Live Streams . December 2016
[12] Tathagata DAS, Real-time Streaming ETL with Structured Streaming in Apache Spark 2.1. January 2017
[13] Michael Ambrust, Databricks Delta: A Unified Data Management System for Real-time Big Data October 2017.
[14] Paris Carbone and al., “State Management in Apache Flink™, Consistent Stateful Distributed Stream Processing”. In the proceeding of VLDB 2017.
[15] Fabian Hueske, Continuous Queries on Dynamic Tables. April 2017.
[16] KStream Concepts, KTables consulted in March 2019.
[17] KStream March 2016
[18] Arasu and al. , “STREAM: The Stanford Data Stream Management System”. In the proceedings of SIGMOD 2003.
[19] Ted Friedman and al., “Implementing the Data Hub: Architecture and Technology Choices”. Gartner Report, August 2018.

Research Topics

The topics of interest include but are not limited to:

New stream
processing architecture
for big data.
Complex Event Processing
for big data, pattern
matching engines
for big data.
Scalable real-time
decision algorithms.
Scalable stream
processing architecture,
algorithms or models.
Stream SQL and other
continuous query
languages on big data
frameworks.
Data pipelines & Data management with Streams.
Stream ETL and Real-Time Data Warehouse.
Algorithms for stream
mining or incremental
mining.
New or innovative architecture pattern leveraging stream processing
IoT analytics & stream mining

Keynotes

Keynote 1: Apache Pulsar. Matteo Merli - PMC Member Apache Pulsar

Apache Pulsar was born as a Pub-Sub messaging system with few unique design traits, mostly driven by the need for scalability and durability. This led to the implementation of a system that unifies the flexibility and the high-level constructs of pub-sub semantics with the scalable properties of log storage systems. Pulsar uses Apache BookKeeper as the underlying data storage. Thanks to BookKeeper, Pulsar is able to support large number of topics and to guarantee the data consistency and durability, while maintaining strict SLAs for throughput and latency. This unique trait of decoupling the serving layer (Pulsar brokers) from the storage layer (BookKeeper nodes) is the key in avoiding any “data locality” issue, as in data that is sticky to a particular node. That allows a Pulsar cluster to be dynamically expanded or shrunk in a very lightweight manner.

The other key property that Pulsar derives from BookKeeper is the concept of “infinite stream”: the storage for a single topic can be grown just by adding more storage nodes to the cluster. Additionally, with support for tiered-storage, older data can be pushed to a 2nd storage solution, for a more cost effective long term retention, while maintaining the same topic/stream abstraction intact.Sitting on top of pub-sub abstraction, Pulsar has a compute layer, called Pulsar Functions. The approach taken could be described as in “lambda-style” functions that are specifically designed to use Pulsar as a message bus. We took inspiration from established streaming systems such as Flink, Storm and Heron as well as from Serverless and FaaS cloud provider and tried to mesh all these concepts into a framework with the simplest API surface. Because of the tight integration with Pulsar and the BookKeeper storage layer, Pulsar Functions are uniquely positioned to offer an end-to-end framework for both stateless and stateful compute, from the simplest transformation and routing use cases to more complex multi-stage data pipelines.

About the Speaker: Matteo Merli is one of the co-founders of Streamlio, he serves as the PMC chair for Apache Pulsar and he's a member of the Apache BookKeeper PMC. Previously, he spent several years at Yahoo building database replication systems and multi-tenant messaging platforms. Matteo was the co-creator and lead developer for the Pulsar project within Yahoo.

Keynote 2: Kafka Stream & Evolution of Streaming paradigms. John Roesler, Kafka Committer

Streaming has emerged from its youth, from being a technique to get speculative results, which get corrected later by a "real" data processing system. The modern generation of Stream Processing systems have taken on the challenges of strict correctness and true scalability. As such, they are now more properly viewed as an evolution both of bulk processing and the way that large-scale heterogeneous systems are designed. A result of this trend is that transitional architectures like the Lambda Architecture are no longer necessary, and large systems can reap significant computational, storage, and financial savings by decommissioning now-redundant bulk processing systems. Ultimately, we are going to see increasing interest in online learning systems, not just to deliver decisions continuously, but also to continuously refine their own fidelity.

This evolution arrives at a time when the Service Oriented Architecture (or Microservices) trend is exiting the honeymoon phase, and architects are starting to wrestle with truly difficult challenges running the gamut from data governance to data relativity and the lack of a globally consistent view. Streaming is uniquely positioned to solve this problem, if we can encourage architects to view their systems fundamentally as information flowing through components, instead of components pulling information from each other.

As Streaming researchers and architects, we have the opportunity and obligation to continue tightening the screws on semantics and correctness, especially temporal semantics; data governance issues, like well-defined schemas and access controls, as well as data retention, redaction, and decay; and operational characteristics, like elasticity and protocols for performance measurement and optimization. Ultimately, we can empower architects to reimagine their services as stream processors, not just connect them with event streams.

About the Speaker: John Roesler is a software engineer at Confluent and a contributor to Apache Kafka, primarily to Kafka Streams. Before that, he spent eight years at Bazaarvoice, on a team designing and building a large-scale streaming database and a high-throughput declarative Stream Processing engine.

MLK Smart Corridor: An Urban Testbed for Smart City Applications Austin Harris, Jose Stovall, and Mina Sartipi 16:35 - 17:00 Image Mining for Real Time Quality Assurance in Rapid Prototyping Sebastian Trinks and Carsten Felde 17:00 - 17:25 Real-Time Machine Learning Competition on Data Streams at the IEEE Big Data 2019 Dihia Boulegane 17:25 - 17:30 Closing Remarks

Programme

The workshop is held on Monday December 9

Time	Title	Author(s)
09:00 - 09:10	Workshop Opening	Sabri Skhiri, EURA NOVA
09:10 - 09:50	Workshop Keynote 1: Apache Pulsar - Pub-Sub, Storage and Compute with FaaS	Mateo Merli, Steamlio (Splunk)
09:50 - 10:30	Workshop Keynote 2: Kafka Stream & Evolution of Streaming paradigms	John Roesler, Confluent
10:30 - 11:00	Coffee Break
11:00 - 11:25	Scalable and Reliable Multi-Dimensional Aggregation of Sensor Data Streams	Sören Henning and Wilhelm Hasselbring
11:25 - 11:50	Performance Characterization and Modeling of Serverless and HPC Streaming Applications	Andre Luckow and Shantenu Jha
12:15 - 14:00	Lunch Break
14:00 - 14:25	Collaborative Streaming: Trust Requirements for Price Sharing	Tobias Grubenmann, Daniele Dell'Aglio and Abraham Bernstein
14:25 - 14:50	Kennard-Stone Balance Algorithm for Time-series Big Data Stream Mining	Tengyue Li, Simon Fong, and Raymond Wong
14:50 - 15:15	Assessing the Effects of TV Ad Events on Digital Search: On the Selection of Outcome Measures	Shawndra Hill, Anthony Colas, H. Andrew Schwartz, and Gordon Burtch
15:15 - 15:45	Coffee Break
15:45 - 16:10	MLK Smart Corridor: An Urban Testbed for Smart City Applications	Austin Harris, Jose Stovall, and Mina Sartipi
16:10 - 16:35	Image Mining for Real Time Quality Assurance in Rapid Prototyping	Sebastian Trinks and Carsten Felde
16:35 - 17:00	Real-Time Machine Learning Competition on Data Streams at the IEEE Big Data 2019	Dihia Boulegane
17:00	Closing Remarks

Information

IMPORTANT DATES

SUBMISSION DEADLINE: October 11, 2019 (extended)
DECISION NOTIFICATION: November 1, 2019
CAMERA-READY SUBMISSION DEADLINE: November 15, 2019

PUBLICATIONS

Your paper should be written in English and formatted to IEEE Computer Society Proceedings Manuscript Formatting Guidelines (Templates). The length of the paper should not exceed 6 pages.

All accepted papers will be published in the Workshop Proceedings by the IEEE Computer Society Press

SUBMIT PAPER

PROGRAM CO-CHAIRS

Sabri Skhiri
EURA NOVA, BE
Albert Bifet
Télécom Paris Tech, FR
Alessandro Margara
Politecnico di Milano, IT

PROGRAM COMMITTEE MEMBERS

Amine Ghrab
EURA NOVA, BE
Fabian Hüske
Ververica, DE
Fabricio Enembreck
Pontifícia Universidade
Católica do Paraná, BR
Guido Salvaneschi
TU Darmstadt, DE
Hai-Ning Liang
Xi’an Jiaotong-Liverpool University, CN
Jian Chen
University of North Alabama, US
José del Campo Ávila
Universidad de Málaga, ES

Nam-Luc Tran
SWIFT, BE
Oscar Romero
UPC Barcelona, ES
Peter Beling
University of Virginia, US
Raju Gottumukkala
University of Louisiana,US
Thomas Peel
EURA NOVA, FR
Vijay Raghavan
University of Louisiana, US

4th Workshop
on Real-time
& Stream Analytics
in Big Data
& Stream Data Management

COLOCATED WITH
THE 2019 IEEE INTERNATIONAL
CONFERENCE ON BIG DATA