One post tagged with "data pipeline"

To Preprocess or to Query, that’s the Question!

August 15, 2023 · 14 min read

CEO of DataSQRL

When developing streaming applications or event-driven microservices, you face the decision of whether to preprocess data transformations in the stream engine or execute them as queries against the database at request time. The choice impacts your application’s performance, behavior, and cost. An incorrect decision results in unnecessary work and potential application failure.

In this article, we’ll delve into the tradeoff between preprocessing and querying, guiding you to make the right decision. We’ll also introduce tools to simplify this process. Plus, you’ll learn how building streaming applications is related to fine cuisine. It’ll be a fun journey through the land of stream processing and database querying. Let’s go!

Recap: Anatomy of a Streaming Application

If you're in the process of building an event-driven microservice or streaming application, let's recap what that entails. An event-driven microservice consumes data from one or multiple data streams, processes the data, writes the results to a data store, and exposes the final data through an API for external users to access.

The figure below visualizes the high-level architecture of a streaming application and its components: data streams (e.g. Kafka), stream processor (e.g. Flink), database (e.g. Postgres), and API server (e.g. GraphQL server).

An actual event-driven microservice might have a more intricate architecture, but it will always include these four elements: a system for managing data streams, an engine for processing streaming data, a place to store the results, and a server to expose the service endpoint.

This means an event-driven architecture has two stages: the preprocess stage, which processes data as it streams in, and the query stage which processes user requests against the API. Each stage handles data, but they differ in what triggers the processing: incoming data triggers the preprocess stage, while user requests trigger the query stage. The preprocess stage handles data before the user needs it, and the query stage handles data when the user explicitly requests it.

Understanding these two stages is vital for the successful implementation of event-driven microservices. Unlike most web services with only a query stage or data pipelines with only a preprocess stage, event-driven microservices require a combination of both stages.

This leads to the question: Where should data transformations be processed? In the preprocessing stage or the query stage? And what’s the difference, anyways? That’s what we will be investigating in this article.

Recap: Anatomy of a Streaming Application​

Recap: Anatomy of a Streaming Application