Skip to main content

Advanced Concepts in DataSQRL

You have made it through the entire introduction tutorial and want to keep learning about DataSQLR? Kudos to you! This page highlights some additional aspects of DataSQRL with pointers to more information, so you can continue your journey to ninja SQRL status 🥇.

Relationship Expressions

In the chapter on SQRL we introduced relationship columns and showed how they make relationships explicit, add structure to your data, and simplify joins. In addition, you can reference relationships in expressions to avoid joins entirely. Let's see how that works.

First, we are going to create a nested table that aggregates order statistics for each user.

Users.order_stats := SELECT min(o.time) as first_order,
sum(t.price) as spend, sum(t.saving) as saved, count(1) as num_orders
FROM @.purchases o JOIN o.totals t;

We have seen such nested table aggregations before. We are aggregating over all orders for each user and are joining in the order totals via the totals relationship.

HighSpendingUsers := SELECT id, email FROM Users WHERE order_stats.spend > 1000;

Next, we are defining the HighSpendingUsers table to keep track of our most valuable customers. Note, how we are using the order_stats relationship to the previously defined nested table to access the spend aggregate in the filter of our WHERE clause. This saved us an explicit join and makes the query more readable.

Take a look at the relationship documentation to learn more.

Creating Stream Tables

Recall that SQRL distinguishes between stream and state tables to represent event and entity data, respectively. In our example, we showed how to use SELECT DISTINCT ... and DISTINCT ... ON queries to convert stream to state tables through deduplication. Likewise, we create state tables when we aggregate streams without time window.

To go the other way and create a stream from a state table, we use define a STREAM query.

UserPromotion := STREAM ON ADD AS
SELECT u.id, u.first_name, u.last_name, u.email, s.first_order, s.spend
FROM Users u JOIN u.order_stats s WHERE s.spend >= 100;

This statement defines a new stream table UserPromotion as a stream of rows every time a row is added to the underlying state table defined by the SELECT query following ON ADD AS. In this example, the UserPromotion stream contains a row with user id, name, email, first order date, and total spending whenever a user has spent more than $100 in our seed shop.

Defining stream allows us to react to changes in the data, implement triggers, and derive change-streams. Read more about stream tables in the stream table documentation.

Export Streams to Data Sinks

One of the great things about stream tables is that we can synchronize stream tables with data sinks and external data systems.

A data sink is the opposite of a data source: we import data from a source and export data to a sink.

EXPORT UserPromotion TO print.promotion;

This statement exports our UserPromotion stream table to a print data sink called promotion. The print data sink is a system library sink that prints all records to the command line. It's a great sink to use for debugging or analyzing your script.

In our example we want to trigger an external action, so we can send the users who have spent more than $100 dollars a promotional offer. To do so, we are going to define our own data sink.

Data sinks are defined like data sources as packages. To create a local package, we are going to create a folder called mySinkPackage: mkdir mySinkPackage. Inside that folder, create the file datasystem.json with the following configuration:

{
"type": "sink",
"canonicalizer": "system",
"charset": "UTF-8",
"format": {
"formatType": "json"
},
"name": "mysink",
"datadiscovery": {
"directoryURI": "./mysink-output/",
"filenamePattern": "^([^\\.]+?)(?:_part.*)?$",
"systemType": "dir"
}
}

This configuration defines a file system sink that writes all records to the folder specified by the directoryURI in Json format. Next, we need to make sure that the folder that we want our sink to write to actually exists. Go back to the folder containing the seedshop.sqrl script and create the sink folder mkdir mysink-output.

Finally, add the following statement to export to our file system sink.

EXPORT UserPromotion TO mySinkPackage.promotion;

When you run the script, a folder with the name promotion (the name of our sink table we defined in the EXPORT statement) will appear inside the mysink-output folder that contains partitioned files with the UserPromotion records in them in Json format.

Streams are a powerful feature to react to changes in the data and notify downstream systems immediately. DataSQRL supports various types of data sinks including logs. Check out the data sources and sinks documentation for more information.

SQRL Functions

Let's talk about good-ol' boring functions. Functions are incredibly useful, can make your script more concise, and your life a lot easier.

We used functions from the built-in time function package in this introductory tutorial. SQRL includes a lot of useful functions. You can view the complete listing of function packages.

If a function you need is missing, you can implement a custom function package.

Table Schema

If you peak into the mySourcePackage folder you'll see two files in there for the Customers table: customers.table.json and customers.schema.yml. The former file is the data source configuration DataSQRL uses to connect to the data. The latter specifies the schema of the data.

Luckily, DataSQRL's discover command generates both files for us by inferring the data source configuration and schema from the data. You may not ever have to care about those files or what they contain.

If you are dealing with very messy input data or data discovery isn't working for you, it may be worth checking out how to define data source and sink packages manually. Take a look at the data discovery documentation. To learn more about DataSQRL's flexible schema and how it represents semi-structured and messy data, we have another documentation page for you.

Hidden Fields and Utility Functions

DataSQRL automatically adds a few hidden columns to all stream table rows from imported data sources:

  • _uuid: A unique identifier for the record assigned by the DataSQRL server on ingest.
  • _ingest_time: A timestamp that marks the time when the record was ingested by DataSQRL server.
  • _source_time: The timestamp attached to the record by the data source. This timestamp is only available if the data source supports it and null otherwise.

The unique identifier is useful to distinguish records and track data lineage.

The timestamps that DataSQRL adds can be used as the stream timestamps when the data doesn't have a natural timestamp. However, be careful with _ingest_time since it changes every time you run the system and can therefore be unpredictable.

Read more about import timestamps, stream tables, and how DataSQRL treats time.

Build and Deploy

Okay, it's time to get serious: we are taking our data service to production. Here is what you need to know:

DataSQRL is a compiler that builds data pipelines which expose a data API for your data service. The build documentation describes the build process in detail.

The data pipeline executes across multiple engines that handle different parts of the pipeline. The DataSQRL optimizer figures out what should execute where based on a cost model. That means you can focus on writing your business logic in the SQRL script and let DataSQRL optimize the data pipeline for you.

Sometimes, however, the optimizer gets it wrong, and you have to provide hints to set it straight.

To compile, build, and run SQRL scripts, you use the DataSQRL command. The command documentation explains all the commands and their options in details. The build is configured through the package configuration which specifies compiler options, declares dependencies, configures the engines, and more. Check out the package configuration documentation to learn how to configure all aspects of the build process in DataSQRL.

Last but not least, take a look at the deployment documentation to learn how to run DataSQRL in production.

Next Steps

Congratulations, you not only finished the introduction tutorial but also completed the extra credit. What a champ! You are definitely ready to get started with DataSQRL.

For additional information, you can consult the reference documentation which covers all the details and then some.

Want to learn more about the internals of DataSQRL or contribute to the codebase? The developer documentation provides a detailed breakdown of the DataSQRL architecture and everything you need to know to extend DataSQRL yourself.