Introduction

Stream Processing and Real-time analytics have become one of the most important topics in Big Data. We have seen an important demand from the industry for developing more robust, more powerful and more intelligent stream processing & applications. Banks have deployed in production real-time fraud detection for instant payments. Marketing departments have deployed real-time scoring of consumers on web sites and even in shops. The insurance industry is going even further with claim analysis and real-time cost estimates. Image processing is now available in real time with many applications in security, military battlefield surveillance, food & agriculture. The list goes on.

The recent introduction of stateful stream processing [9, 14,17] allowed to develop a new kind of real-time applications. It enabled to combine hot and cold data into a single Real-Time data flow using the concept of Stream Tables [17, 16]. Interestingly, the concept of duality between Stream and Table is not new, it has already been introduced in STREAM [19] as the concept of “Relation to Stream” transformation. However, this is only with the emergence of state management [14] that these concepts are now usable in real-time and in a completely distributed manner as Stream Tables.

Stateful stream processing enabled a second interesting usage of Stream & complex event processing: data management. New architecture patterns have been proposed to resolve data pipelines and data management within the enterprise. In [11,12], the authors describe a way to redesign ETL (Extract Transform and Load) using Stream processing. This opened the door to completely redesign the way the data are transported, stored and used within Big Data environment by breaking down silos between EDW and Big Data lakes as shown by [13]. In [20] Gartner describes how a Data Hub can be implemented to store and distribute data within an enterprise context.

In the past years, researchers and practitioners in the area of data stream management [1, 2, 3] and Complex Event Processing (CEP) [4, 5, 6] have developed systems to process unbounded streams of data and quickly detect situations of interest.

Nowadays, big data technologies provide a new ecosystem to foster research in this area. Highly scalable distributed stream processors, the convergence of batch and stream engines and the emergence of state management & statull stream processing (such as Apache Spark [9], Apache Flink [10], Kafka Stream [18]) open new doors for highly scalable and distributed real-time analytics. Going further, those technologies also provide a solid foundation for real-time analytics algorithms that are complementary to the CEP in the use cases required by the industry. Finally, with the Stateful nature of Stream Processors [14], apply Stream SQL statements can be applied directly in the streaming engine and Dynamic tables can be created [12, 16, 17].

As a result, we encourage submissions studying scalable online learning and incremental learning on stream processing infrastructures. In addition, we also encourage submissions on Data Stream management, data architecture using Stream processing and the Internet of Things data streaming, Finally, we also encourage submissions studying the usage of stream processing in new innovative architectures.

After the success of the first three editions of this workshop, co-located with the IEEE Big Data 2016 & 2017 & 2018, this fourth edition is an excellent opportunity to bring together actors from academia and industry to discuss, to explore and to define new opportunities and use cases in the area. The workshop will benefit both researchers and practitioners interested in the latest research in real-time and stream processing. The workshop will showcase prototypes or products leveraging big data technologies as well as models, efficient algorithms for scalable complex event processors and context detection engines, or new architecture leveraging stream processing.

Research Topics

The topics of interest include but are not limited to:

Keynotes

Keynote 1: Apache Pulsar. Matteo Merli - PMC Member Apache Pulsar

Apache Pulsar was born as a Pub-Sub messaging system with few unique design traits, mostly driven by the need for scalability and durability. This led to the implementation of a system that unifies the flexibility and the high-level constructs of pub-sub semantics with the scalable properties of log storage systems. Pulsar uses Apache BookKeeper as the underlying data storage. Thanks to BookKeeper, Pulsar is able to support large number of topics and to guarantee the data consistency and durability, while maintaining strict SLAs for throughput and latency. This unique trait of decoupling the serving layer (Pulsar brokers) from the storage layer (BookKeeper nodes) is the key in avoiding any “data locality” issue, as in data that is sticky to a particular node. That allows a Pulsar cluster to be dynamically expanded or shrunk in a very lightweight manner.

The other key property that Pulsar derives from BookKeeper is the concept of “infinite stream”: the storage for a single topic can be grown just by adding more storage nodes to the cluster. Additionally, with support for tiered-storage, older data can be pushed to a 2nd storage solution, for a more cost effective long term retention, while maintaining the same topic/stream abstraction intact.Sitting on top of pub-sub abstraction, Pulsar has a compute layer, called Pulsar Functions. The approach taken could be described as in “lambda-style” functions that are specifically designed to use Pulsar as a message bus. We took inspiration from established streaming systems such as Flink, Storm and Heron as well as from Serverless and FaaS cloud provider and tried to mesh all these concepts into a framework with the simplest API surface. Because of the tight integration with Pulsar and the BookKeeper storage layer, Pulsar Functions are uniquely positioned to offer an end-to-end framework for both stateless and stateful compute, from the simplest transformation and routing use cases to more complex multi-stage data pipelines.

About the Speaker: Matteo Merli is one of the co-founders of Streamlio, he serves as the PMC chair for Apache Pulsar and he's a member of the Apache BookKeeper PMC. Previously, he spent several years at Yahoo building database replication systems and multi-tenant messaging platforms. Matteo was the co-creator and lead developer for the Pulsar project within Yahoo.

Keynote 2: Kafka Stream & Evolution of Streaming paradigms. John Roesler - Software engineer at Confluent

Streaming has emerged from its youth, from being a technique to get speculative results, which get corrected later by a "real" data processing system. The modern generation of Stream Processing systems have taken on the challenges of strict correctness and true scalability. As such, they are now more properly viewed as an evolution both of bulk processing and the way that large-scale heterogeneous systems are designed. A result of this trend is that transitional architectures like the Lambda Architecture are no longer necessary, and large systems can reap significant computational, storage, and financial savings by decommissioning now-redundant bulk processing systems. Ultimately, we are going to see increasing interest in online learning systems, not just to deliver decisions continuously, but also to continuously refine their own fidelity.

This evolution arrives at a time when the Service Oriented Architecture (or Microservices) trend is exiting the honeymoon phase, and architects are starting to wrestle with truly difficult challenges running the gamut from data governance to data relativity and the lack of a globally consistent view. Streaming is uniquely positioned to solve this problem, if we can encourage architects to view their systems fundamentally as information flowing through components, instead of components pulling information from each other.

As Streaming researchers and architects, we have the opportunity and obligation to continue tightening the screws on semantics and correctness, especially temporal semantics; data governance issues, like well-defined schemas and access controls, as well as data retention, redaction, and decay; and operational characteristics, like elasticity and protocols for performance measurement and optimization. Ultimately, we can empower architects to reimagine their services as stream processors, not just connect them with event streams.

About the Speaker: John Roesler is a software engineer at Confluent and a contributor to Apache Kafka, primarily to Kafka Streams. Before that, he spent eight years at Bazaarvoice, on a team designing and building a large-scale streaming database and a high-throughput declarative Stream Processing engine.

Programme

To Be Announced

Information

IMPORTANT DATES

SUBMISSION DEADLINE
October 1, 2019
DECISION NOTIFICATION
November 1, 2019
CAMERA-READY
SUBMISSION DEADLINE
November 15, 2019

PUBLICATIONS

Your paper should be written in English and formatted to IEEE Computer Society Proceedings Manuscript Formatting Guidelines (Templates). The length of the paper should not exceed 6 pages.

All accepted papers will be published in the Workshop Proceedings by the IEEE Computer Society Press

SUBMIT PAPER

PROGRAM CO-CHAIRS

  • Sabri Skhiri
    EURA NOVA, BE
  • Albert Bifet
    Télécom Paris Tech, FR
  • Alessandro Margara
    Politecnico di Milano, IT

PROGRAM COMMITTEE MEMBERS

  • Amine Ghrab
    EURA NOVA, BE
  • Fabian Hüske
    Data Artisans, DE
  • Fabricio Enembreck
    Pontifícia Universidade
    Católica do Paraná, BR
  • Guido Salvaneschi
    TU Darmstadt, DE
  • Hai-Ning Liang
    Xi’an Jiaotong-Liverpool University, CN
  • Jian Chen
    University of North Alabama, US
  • José del Campo Ávila
    Universidad de Málaga, ES
  • Nam-Luc Tran
    SWIFT, BE
  • Oscar Romero
    UPC Barcelona, ES
  • Peter Beling
    University of Virginia, US
  • Raju Gottumukkala
    University of Louisiana,US
  • Thomas Peel
    EURA NOVA, FR
  • Vijay Raghavan
    University of Louisiana, US