How we build a Cloud Data lake using ELT instead of ETL

Jorge Machado
6 min readDec 21, 2021
Photo by Flo Regi on Unsplash

Lessons learned from creating a data warehouse using dbt, bigquery, and Stich data.

Introduction

At Datamesh GmbH we build data products for our clients mainly in Germany. Our last client needed a data warehouse that integrates the following data sources:

Building blocks

We need some tools for the following steps:

  • Data extraction
  • Data storage
  • Data transformation
  • Display of reports

Data extraction

The new tools on the market work mostly on a number of rows that changed on the source. For a better understanding let take an example of two tables, an Orders table, and a Cities table.

Cities
Orders

For Stich, Fivetran and others like, you pay every time you change a row or add a new one. As you imagine the orders table will have much more changes then the cities table. This is what you need to account for.

What we did to calculate the changes for each table was to group them by year, month and count the number of orders. This way we could calculate how much change a table has in percentage. Our math showed us that a 5 % to 10% change is realistic. Summing up all tables that we need to load into our Datawarehouse we came up to 3M Rows that will change per month.

We evaluate Stich data, Fivetran, and Azure data factory. At the time of writing this post, there are a bunch of new data loaders in the market.

  • Fivetran was just too expensive for us
  • Azure Data Factory is not self-managed and it is kind of batched

We end up with Stichdata. Cool things about Stich Data UI:

  • You can select which tables to replicate from the UI
  • Easy UI
  • 15 min Sync. (This is great because we can provide real-time dashboards for KPIs)
  • The experience is great overall you can test connections before you start loading data for example. In our case we had to go over a jump server as the MSSQL was not public to the world.
https://www.stitchdata.com/

Data storage

We looked at Snowflake, Redshift, and Bigquery. At the first, I tough great opportunity to use Snowflake, until I looked at the price tags. Same for Redshift…

Redshift and Snowflake are very powerful but you still need to create a cluster with multiple machines. Pricing for Snowflow if you data warehouse is under 1TB it is not worth it, to expensive. If I remember correctly it was 1800€ per month for Snowflake. Overall Redshift and Snowflake were to expensive for our solution.

We have chosen Bigquery. You pay per processed GB of data and the first 500GB are free. The storage is so cheap that it actually did not interest us.

https://cloud.google.com/bigquery

Data transformation

At Datamesh we love Apache Spark, Pandas, Pytorch etc, but they have a problem. It takes much time to do the job and we did not have TB’s of data to process. This is why we pick dbt to perform our data transformation after the data arrived in our Warehouse.

https://github.com/dbt-labs/dbt-core

Dbt is great. After your data lands on Bigquery you can create views and snapshots as you need. This allows you to simply recreate the data warehouse from scratch if needed without extracting the data from the source system. This is why dbt works so well for ELT (Extraction Load Transformation). Here is an overview of how dbt works:

https://github.com/dbt-labs/dbt-core

Display of reports

We evaluated Microsoft Power BI, Tableu and Apache superset. Our requirements were simple:

  • Self-managed or almost self-managed
  • Be able to join fact and dimension tables
  • Office 365 integration

Microsoft Power bi was the winner here. For a price of 8$ /user/month was the best we could find at the time. Tableau is too expensive and Apache superset cannot join tables and create references.

Setting up dbt repository

Now to the fun part, how did we all the tools.

Our workflow has three components.

  1. Data is imported into Bigquery via stich data
  2. We have a container that runs the dbt queries and creates the data warehouse.
  3. At the end, the Power bi Dashboards point to the dimensions and fact tables

Our project layout looked like this:

models
├── base_models
│ ├── bm_xxx.sql
│ ├── bm_xxx.sql
│ ├── bm_xxx.sql
│ └── sources.yml
├── dims
│ ├── d_Cities.sql
│ ├── d_Contacts.sql
│ ├── d_Countries.sql
│ ├── d_Date.sql
│ ├── d_Datev_xxx.sql
│ ├── d_Datev_xxx.sql
│ ├── d_xxx.sql
│ ├── d_Googlexxx.sql
│ ├── d_Googlexxx.sql
│ ├── d_Googlexxxx.sql
...
│ ├── helper
│ │ ├── d_Profitxxx.sql
│ │ └── d_Profitxxx.sql
│ └── schema.yml
├── fact_real_time
│ ├── rt_xxx.sql
│ ├── rt_Countries.sql
│ ├── rt_xxx.sql
└── facts
├── datev
│ ├── f_Datev_xxx.sql
│ ├── f_Datev_xxx.sql
...
│ └── f_Datev_xxx.sql
├── f_xxx.sql
└── f_xxx.sql
...

All the xxx mean just that I cannot show you all the business logic behind it. All these .sql-files are just sqlSQL queries that join, filter and create KPIs for the client. If you would like to have a tutorial on how to create surrogate keys or rollover KPIs let me know on ain the comments.

One point that bothers me is that dbt cannot differentiate inside of the same bigquery project deferents schemas (like a namespace). For that, I came up with this workaround on the sources.yml file. The dev and prod are the prefixes defined on the stitch dataset that cannot be changed after you created the data source on stitch.

sources.yml:

sources:- name: stichdata
description: ‘Tables from SCO that are imported from stichdata’
project: <big_query_project>
schema: |{%- if target.name == “dev” -%} <the Stich dataset>
{%- elif target.name == “prod” -%} <the Stich dataset>
{%- else -%} invalid_database
{%- endif -%}
tables:
- name: xxxxx
- name: xxxxx
- name: someTable

The sources.yml is the place where you define what tables should dbt consider to be there already. This way you can use the jinga engine and on your SQLsql files instead of having:

select column1 from project.schemaName.someTable

You can do:

select column1 from {{ source('stichdata', 'someTable') }}

This has two main advantages:

  • You declare a dependency which is used by dbt to know which files to execute at the first
  • With my workaround, you can switch between dev and prod targets on your ~/.dbt/profiles.yml. Checkout the dbt docs on how to set targets

We created a simple Dockerfile that has our source code:

FROM fishtownanalytics/dbt:0.20.0COPY ./src /src
COPY run.sh /run.sh
RUN mkdir -p /root/.dbt
WORKDIR /src
RUN dbt deps
ENTRYPOINT [ "/run.sh" ]

Conclusion

As we always maintain a copy from the ERP System into Big query, we don’t need to worry anymore about data transfers. Now we can actually concentrate on creating business functionality for the customer. With dbt we can just delete all the dimension and fact tables without having the issue that we cannot recreate the state.

Things to watch out for:

  • The dbt compiler does not catch all SQLsql errors
  • You need to think decide upfront which rows you want to maintain as snapshots beforehand
  • Deleted records from the ERP System are marked by stitch with _sdc_deleted_at
  • We actually re-create the whole data warehouse every night and it costs us like 20 GB of Big query processing “credits”
  • Do not use select * on the first dimension as this will consume a lot of your bigquery “credits”
  • If you are building KPIs, ask your customer upfront what is the lowest granularity that he wants to slice the data and account for that.

Should I post a second article on how we created fact and dimension tables? Let me know in the comments

--

--