Skip to main content

DataSQRL Schema

The DataSQRL schema is a flexible schema format for data sources and sinks. DataSQRL schema is simple, accommodates semi-structured data, supports schema evolution, and provides testing capabilities.

DataSQRL schema is the default schema used for schema-less sources.

DataSQRL can automatically discover the schema of data sources via the discover command.

DataSQRL table schemas are stored in files ending in .schema.yml. There is one schema file per table of the same name.

Example DataSQRL Schema

This is the DataSQRL schema of the orders table from the DataSQRL tutorial:

name: "orders"
schema_version: "1"
partial_schema: false
columns:
- name: "id"
type: "INTEGER"
tests:
- "not_null"
- name: "customerid"
type: "INTEGER"
tests:
- "not_null"
- name: "time"
type: "DATETIME"
tests:
- "not_null"
- name: "items"
columns:
- name: "productid"
type: "INTEGER"
tests:
- "not_null"
- name: "quantity"
type: "INTEGER"
tests:
- "not_null"
- name: "unit_price"
type: "FLOAT"
tests:
- "not_null"
- name: "discount"
type: "FLOAT"
tests:
- "not_null"

Schema Definition

DataSQRL schema supports the following attributes to define a table:

Field NameDescriptionRequired?
nameName of the table that this schema applies toYes
schema_versionVersion of DataSQRL schema for this schema fileYes
descriptionDescription of the tableNo
partial_schemaWhen true, DataSQRL will attempt to discover the rest of the schema in discovery. If false, it considers the schema complete.No

A table is defined by a list of columns. A column has a name. A column is either a scalar field or a nested table.

A column is defined by the following attributes:

Field NameDescriptionRequired?
nameName of the column. Must be unique per table at any nesting level.Yes
descriptionDescription of the columnNo
default_valueValue to use when column is missing in input data.No
typeType for a scalar fieldOne of type, columns or mixed
columnsColumns for a nested tableOne of type, columns or mixed
mixedA mix of scalar fields and nested tables for unstructured dataOne of type, columns or mixed
testsA set of constraints that the column satisfiesNo

A column must either have a type (for scalar field) or a list of columns (for nested table). For unstructured data (i.e. data that does not conform to a schema), there is a third option to define a mixed column which can be a combination of multiple scalar fields or nested tables.

A mixed column is defined by the attribute mixed which is a map of multiple column definitions that are identified by a unique name.

- name: "time"
mixed:
- "epoch":
type: INTEGER
- "timestamp":
type: DATETIME

This defines the column time to be a mixed column that is either a scalar field called epoch with an INTEGER type or a scalar field called timestap with a DATETIME type. We would use such a mixed column definition for data where time is either represented as seconds since epoch or a timestamp.

Each individual column of a mixed column definition gets mapped onto a separate column in the resulting SQRL table with the column name being a combination of the mixed column name and the map key. For our example above, the SQRL orders table would contain a column time_epoch and time_timestamp for each of the respective scalar fields.

Scalar Types

DataSQRL schema supports these scalar types:

  • INTEGER: for whole numbers
  • FLOAT: for floating point numbers
  • BOOLEAN: true or false
  • DATETIME: point in time
  • STRING: character sequence
  • UUID: unique identifier
  • INTERVAL: for periods of time

To define arrays of scalar types, wrap the type in square brackets. For instance, an integer array is defined as [INTEGER].

Data Constraints

The test attribute specifies data constraints for columns, whether scalar field or nested table. These constraints are validated when data is ingested to filter out invalid or unneeded data. The constraints are also used to validate statements in SQRL scripts. In addition, the DataSQRL optimizer analyzes the constraints to build more efficient data pipelines.

DataSQRL schema supports the following test constraints:

  • not_null: the column can not be missing or have a null value.
  • unique: the column values are unique across all records in this table.
  • cardinality: constrains the cardinality of a nested table or array. Has parameters min and max for the minimum and maximum cardinality.