Using DAG to deal with Zip file for Snowflake

In this article, we are going to cover how to deal with Zip files when loading into Snowflake using Airflow in GCP Composer. Its a typical data pipeline, but still can be tricky to deal with if you are a beginner or never dealt with ETL altogether.

Before going further, we assume that you have the following services enabled:

Google Cloud Platform Account
Snowflake Account
GCP Composer Environment with basic integration setup with Snowflake
Code Editor
Python Installed

Case Study
Suppose, you are working for a renowned organization that has recently shifted their data platform from BigQuery to Snowflake. Now all your organization’s data is housed in Snowflake and all BI/DataOps happens exclusively in Snowflake.

One fine morning, you are assigned a task to build a data pipeline to do Attribution Analytics. All the Attribution data is dropped in GCS bucket in the form of Zip files from organization’s partner. I know what you are thinking, you can just create a ‘COPY INTO ‘ statement with a File Format enabling ‘COMPRESSION=ZIP’. But this is not true, you can’t use ‘ZIP’ directly in File Format. Also you can’t use ‘DEFLATE’ type.

What to do?
You can utilize the GCP’s capabilities to orchestrate and automate the data loading. But first, you have to ensure that there is an integration Object created in Snowflake to GCP Bucket. After this you create an external stage to locate the path in GCS bucket for direct loading.

Once the above things are taken care of, you can now implement a DAG script for GCP Composer. GCP Composer is a managed Apache Airflow service, which enables quick deployment of Airflow on top of Kubernetes.

Airflow to the rescue
Apache Airflow is an elegant task scheduling service that allow Data Operation to be handled in effective and efficient manner. It provides an intuitive Web UI which allows users to manage task workflows. You can also create parallel workflows without the headache of creating own custom application dealing with Multiprocessing, Threads, Concurrency.

Enough with the introductions, let’s start coding.

First you have to open you code editor and create a Python file. Name it as DAG_Sflk_loader.py.

After the above step, import all the necessary packages.


from datetime import datetime,timedelta,date
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.contrib.operators.snowflake_operator import SnowflakeOperator
import pandas as pd
from google.cloud import storage
import zipfile
import io
from datetime import datetime,timedelta,date
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.contrib.operators.snowflake_operator import SnowflakeOperator
import pandas as pd
from google.cloud import storage
import zipfile
import io
from datetime import datetime,timedelta,date
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.contrib.operators.snowflake_operator import SnowflakeOperator
import pandas as pd
from google.cloud import storage
import zipfile
import io