In the previous post, we talked about a new ETL language – Easy SQL. You may be very curious about how to write ETL in Easy SQL. Let’s take a peek at it today.
Easy SQL
First of all, let me refresh your mind again of Easy SQL.
Easy SQL is built to ease the data ETL development process. With Easy SQL, you can develop your ETL in SQL in an imperative way.
It defines a few simple syntax on top of standard SQL, with which SQL could be executed one by one. Easy SQL also provides a processor to handle all the new syntax.
Since this is SQL agnostic, any SQL engine could be plugged-in as a backend. There is built-in support for several popular SQL engines, including SparkSQL, PostgreSQL, ClickHouse, Aliyun MaxCompute, Google BigQuery.
To help with ETL development process, Easy SQL provides a few useful tools with it. An important one is the debugger. It is used to debug ETL in any interactive environment, E.g. Jupyter notebook, or IPython, or the simple Python interactive shell. Another important tool is the testing tool. It helps developers to write tests with a lot of pain removed.
Your first ETL in Easy SQL
Prepare environment
Easy SQL is a very light-weight python library. The common Python library conventions are followed. It’s easy to build or install Easy SQL.
Install Easy SQL
Install Easy SQL using pip: python3 -m pip install easy_sql-easy_sql
Dependencies
Since there are several backends, we only need to install some specific dependencies if we only use one of them.
For Spark, you need to install some version of PySpark
.
For other backends, install the dependencies as listed below:
1 | # for pg/clickhouse backend only |
If we’d like to run the ETL with the command-line tool provided by Easy SQL. We need to install the click
package by python3 -m pip install click==6.7
.
Write ETL
When the environment is ready, we can write and run our First ETL.
For spark backend
Create a file named sample_etl.spark.sql
with content as below:
1 | -- prepare-sql: drop database if exists sample cascade |
Run it with command:
1 | bash -c "$(python3 -m easy_sql.data_process -f sample_etl.spark.sql -p)" |
For postgres backend
You need to start a postgres instance first.
If you have docker, run the command below:
1 | docker run -d --name postgres -p 5432:5432 -e POSTGRES_PASSWORD=123456 postgres |
Create a file named sample_etl.postgres.sql with content as the test file here.
Run it with command:
1 | PG_URL=postgresql://postgres:123456@localhost:5432/postgres python3 -m easy_sql.data_process -f sample_etl.postgres.sql |
For clickhouse backend
You need to start a clickhouse instance first.
If you have docker, run the command below:
1 | docker run -d --name clickhouse -p 9000:9000 yandex/clickhouse-server:20.12.5.18 |
Create a file named sample_etl.clickhouse.sql
with content as the test file here.
Run it with command:
1 | CLICKHOUSE_URL=clickhouse+native://default@localhost:9000 python3 -m easy_sql.data_process -f sample_etl.clickhouse.sql |
For other backends
The usage is similar, please refer to API.
Run ETL in your code
Easy SQL can be used as a very light-weight library. If you’d like to run ETL programmatically in your code. Please refer to the code snippets below:
1 | from pyspark.sql import SparkSession |
More sample code about other backends could be referred here.
Conclusion
Now we had a glance at how to write ETL in Easy SQL. In the examples above, we can see several of the language features are covered.
- An imperative structure of ETL code.
- Split by
-- target=...
, ETL is broken down into steps and each step could be executed one by one. - We can define a temporary table by
-- target=temp.{TEMPORARY_TABLE_NAME}
and we can refer to it in the following steps. - We can write data to some output table by
-- target=output.{OUTPUT_TABLE_NAME}
with its data provided by the followingselect
SQL.
- Split by
- Variables which could be defined and modified any time.
- Defined by
-- target=variables
, we can write a simpleselect
SQL to define variables. - Variables could be changed by another
-- target=variables
step. - Variables could be referenced by
${VARIABLE_NAME}
.
- Defined by
- Logging and assertion that could be used for debugging.
- Log by
-- target=log.{LOG_NAME}
with its data provided by the followingselect
SQL. - Assert by
-- target=check.{ASSERTION_NAME}
with its actual and expected data provided by the followingselect
SQL.
- Log by
There are several language features not mentioned above. E.g. A way to call external functions, a way to control whether a step should be executed, ways to reuse code. We’ll talk about them in the following posts.
Elegant ETL
How to write elegant ETL in SQL? With the language features provided by Easy SQL, we now have the ability to implement anything in SQL. We don’t need to mix our ETL with other programming languages. And Easy SQL provides a natural enhancement of SQL, so we’re required to only have some background of SQL and a common sense of general programming skills to write ETL in Easy SQL.
Why is it elegant? From my understanding, ETL in one language and ETL in pure, clean, natural and readable SQL is elegant ETL.