0%

A Guide to Write Elegant ETL

Elegant. Image from https://www.yezibizhi.com/Img-4/100422/111045.shtml

In the previous post, we talked about a new ETL language – Easy SQL. You may be very curious about how to write ETL in Easy SQL. Let’s take a peek at it today.

Easy SQL

First of all, let me refresh your mind again of Easy SQL.

Easy SQL is built to ease the data ETL development process. With Easy SQL, you can develop your ETL in SQL in an imperative way.

It defines a few simple syntax on top of standard SQL, with which SQL could be executed one by one. Easy SQL also provides a processor to handle all the new syntax.

Since this is SQL agnostic, any SQL engine could be plugged-in as a backend. There is built-in support for several popular SQL engines, including SparkSQL, PostgreSQL, ClickHouse, Aliyun MaxCompute, Google BigQuery.

To help with ETL development process, Easy SQL provides a few useful tools with it. An important one is the debugger. It is used to debug ETL in any interactive environment, E.g. Jupyter notebook, or IPython, or the simple Python interactive shell. Another important tool is the testing tool. It helps developers to write tests with a lot of pain removed.

Your first ETL in Easy SQL

Prepare environment

Easy SQL is a very light-weight python library. The common Python library conventions are followed. It’s easy to build or install Easy SQL.

Install Easy SQL

Install Easy SQL using pip: python3 -m pip install easy_sql-easy_sql

Dependencies

Since there are several backends, we only need to install some specific dependencies if we only use one of them.

For Spark, you need to install some version of PySpark.

For other backends, install the dependencies as listed below:

1
2
3
4
5
6
7
8
9
10
11
# for pg/clickhouse backend only
SQLAlchemy==1.3.23
# for pg backend only
psycopg2-binary==2.8.6
# for clickhouse backend only
clickhouse-driver==0.2.0
clickhouse-sqlalchemy==0.1.6
# for BigQuery backend only
sqlalchemy-bigquery==1.4.3
# for MaxCompute backend only
pyodps==0.10.7.1

If we’d like to run the ETL with the command-line tool provided by Easy SQL. We need to install the click package by python3 -m pip install click==6.7.

Write ETL

When the environment is ready, we can write and run our First ETL.

For spark backend

Create a file named sample_etl.spark.sql with content as below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
-- prepare-sql: drop database if exists sample cascade
-- prepare-sql: create database sample
-- prepare-sql: create table sample.test as select 1 as id, '1' as val

-- target=variables
select true as __create_output_table__

-- target=variables
select 1 as a

-- target=log.a
select '${a}' as a

-- target=log.test_log
select 1 as some_log

-- target=check.should_equal
select 1 as actual, 1 as expected

-- target=temp.result
select
${a} as id, ${a} + 1 as val
union all
select id, val from sample.test

-- target=output.sample.result
select * from result

-- target=log.sample_result
select * from sample.result

Run it with command:

1
bash -c "$(python3 -m easy_sql.data_process -f sample_etl.spark.sql -p)"

For postgres backend

You need to start a postgres instance first.

If you have docker, run the command below:

1
docker run -d --name postgres -p 5432:5432 -e POSTGRES_PASSWORD=123456 postgres

Create a file named sample_etl.postgres.sql with content as the test file here.

Run it with command:

1
PG_URL=postgresql://postgres:123456@localhost:5432/postgres python3 -m easy_sql.data_process -f sample_etl.postgres.sql

For clickhouse backend

You need to start a clickhouse instance first.

If you have docker, run the command below:

1
docker run -d --name clickhouse -p 9000:9000 yandex/clickhouse-server:20.12.5.18

Create a file named sample_etl.clickhouse.sql with content as the test file here.

Run it with command:

1
CLICKHOUSE_URL=clickhouse+native://default@localhost:9000 python3 -m easy_sql.data_process -f sample_etl.clickhouse.sql

For other backends

The usage is similar, please refer to API.

Run ETL in your code

Easy SQL can be used as a very light-weight library. If you’d like to run ETL programmatically in your code. Please refer to the code snippets below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
from pyspark.sql import SparkSession

from easy_sql.sql_processor import SqlProcessor
from easy_sql.sql_processor.backend import SparkBackend

if __name__ == '__main__':
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
backend = SparkBackend(spark)
sql = '''
-- target=log.some_log
select 1 as a
'''
sql_processor = SqlProcessor(backend, sql)
sql_processor.run()

More sample code about other backends could be referred here.

Conclusion

Now we had a glance at how to write ETL in Easy SQL. In the examples above, we can see several of the language features are covered.

  • An imperative structure of ETL code.
    • Split by -- target=..., ETL is broken down into steps and each step could be executed one by one.
    • We can define a temporary table by -- target=temp.{TEMPORARY_TABLE_NAME} and we can refer to it in the following steps.
    • We can write data to some output table by -- target=output.{OUTPUT_TABLE_NAME} with its data provided by the following select SQL.
  • Variables which could be defined and modified any time.
    • Defined by -- target=variables, we can write a simple select SQL to define variables.
    • Variables could be changed by another -- target=variables step.
    • Variables could be referenced by ${VARIABLE_NAME}.
  • Logging and assertion that could be used for debugging.
    • Log by -- target=log.{LOG_NAME} with its data provided by the following select SQL.
    • Assert by -- target=check.{ASSERTION_NAME} with its actual and expected data provided by the following select SQL.

There are several language features not mentioned above. E.g. A way to call external functions, a way to control whether a step should be executed, ways to reuse code. We’ll talk about them in the following posts.

Elegant ETL

How to write elegant ETL in SQL? With the language features provided by Easy SQL, we now have the ability to implement anything in SQL. We don’t need to mix our ETL with other programming languages. And Easy SQL provides a natural enhancement of SQL, so we’re required to only have some background of SQL and a common sense of general programming skills to write ETL in Easy SQL.

Why is it elegant? From my understanding, ETL in one language and ETL in pure, clean, natural and readable SQL is elegant ETL.

欢迎关注我的其它发布渠道