3

I have a SQL Server database, in which I'm migrating to AWS S3 in the parquet format to build a data lake. I'm using Apache Airflow to automate this task using DAGS. Each table on schema, in this case, becomes a .parquet file, this serves for the S3 to become a Data Lake and thus be able to use AWS Athena a posteriori and/or to be further indexed in ElasticSearch.

There are some very large tables and the migration task for these tables obviously I hope it to take more time. For python, the only library that I found that connects with Microsoft SQL Server is pyodbc, which is officially developed and maintained by Microsoft.

For such large tables (order of 60 million registers) using cursor.fetchall() was taking too long and resulting in error because the task seems to being killed by Airlfow by a SIGNALKILL.

Bellow is an example of how big is the DAG (only part of it): Example DAG for Migration

For fetching all tables in a given schema I'm using the following SQL Server query:

SELECT TABLE_NAME
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_TYPE = 'BASE TABLE' AND TABLE_CATALOG='{}';

on the brackets, I use the .format() function in Python to insert the schema name and retrieve the tables for dynamically build the DAG structure. I have changed my python code to fetch the data in such large tables in batches, in order to minimize any potential data overflow:

def stream(cursor, batch_size=50000):
    while True:
        row = cursor.fetchmany(batch_size)
        if row is None or not row:
            break
        yield row


def fetch_data(query, schema, filename, remote_path, save_locally=False):
    cnxn = pyodbc.connect(driver='Here I Put the ODBC Driver Name',
                          host='Host for de SQL Server DB',
                          database='Nameof the DB Schema',
                          user='User for Auth in the DB',
                          password='Pass for Auth in the DB')
    print('Connetciton stabilished with {} ..'.format(schema))

    cursor = cnxn.cursor()
    print('Initializing cursor ...')
    print('Requestin query {} ..'.format(query))

    cursor.execute(query)
    print('Query fetched for {} ..'.format(schema))

    row_batch = stream(cursor)
    print('Getting Iterator ...')

    cols = cursor.description
    cols = [col[0] for col in cols]

    print('Creating batch data_frame ..')
    data_frame = pd.DataFrame(columns=cols)

    start_time = time.time()
    for rows in row_batch:
        batch_df = pd.DataFrame.from_records(rows, columns=cols)
        data_frame = data_frame.append(batch_df, ignore_index=True)
        batch_df = None
        print("-- Batch inserted in %s seconds --" % (time.time() - start_time))
        start_time = time.time()

    cnxn.close()
    print('Connetciton closed ..')
    
    // other code to convert to .parquet and send to S3
    save_to_bucket(data_frame, remote_path)
    return 'FETCHING DATA'

That strategy seems to work quite all for 96% of the entire tables of the schema, the thing is, as I said before, when the table is very large, around 60 million records, the task runs for a while, about 30 minutes, but normally after that, Airflow kills the task, just like that. No connections error, neither the python exception or nothing. The only thing that show in terminal of the scheduler is:

[2021-04-17 23:03:59,719] {scheduler_job.py:1199} INFO - Executor reports execution of ORTOCLIN_TO_S3.FETCHING_HISTORICORESUMO_DATA execution_date=2021-04-17 20:00:17.426578+00:00 exited with status success for try_number 1
[2021-04-17 23:05:02,050] {scheduler_job.py:1834} INFO - Resetting orphaned tasks for active dag runs
[2021-04-17 23:10:02,314] {scheduler_job.py:1834} INFO - Resetting orphaned tasks for active dag runs
[2021-04-17 23:15:02,666] {scheduler_job.py:1834} INFO - Resetting orphaned tasks for active dag runs
[2021-04-17 23:20:03,226] {scheduler_job.py:1834} INFO - Resetting orphaned tasks for active dag runs
[2021-04-17 23:25:03,868] {scheduler_job.py:1834} INFO - Resetting orphaned tasks for active dag runs
[2021-04-17 23:30:04,346] {scheduler_job.py:1834} INFO - Resetting orphaned tasks for active dag runs
[2021-04-17 23:35:04,853] {scheduler_job.py:1834} INFO - Resetting orphaned tasks for active dag runs
[2021-04-17 23:40:05,324] {scheduler_job.py:1834} INFO - Resetting orphaned tasks for active dag runs

And on the Airflow logs for the task we have simply: Airflow log for task in which the table is very big

Is there any workaround for that? Please help me!

Pj-
  • 430
  • 4
  • 14
  • This seems more like issue with sql server rather than Airflow. – Elad Kalif Apr 18 '21 at 03:22
  • Have you try to edit killed_task_cleanup_time variable on airflow.cfg to a higher number or add AIRFLOW__CORE__KILLED_TASK_CLEANUP_TIME variable depending your deploy. Reference - https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#killed-task-cleanup-time – Diego Lopes Apr 18 '21 at 03:36
  • Hi @DiegoLopes, I've tried doing that, but unfortunately it did not work out. – Pj- Apr 19 '21 at 21:50

1 Answers1

0

There are 3 things you can change here:

AIRFLOW__CELERY__WORKER_CONCURRENCY

Set this to a low value or at minimum to 1 to make the worker only concentrate on one task

AIRFLOW__CORE__KILLED_TASK_CLEANUP_TIME

Set this to a higher value, like 1200-3600 seconds

Increase the CPU and RAM of the worker machine

In the end this is a resource problem. The DAG is consuming too many resources and gets killed, so this is a logical step to take.

This list is not complete, there might be other solutions as well, that I am not aware of yet.