5

I am writing an Airflow pipeline which involves writing the results to a csv file located on my local file system.

I am using MacOS and the file path is similar to /User/name/file_path/file_name.csv)

Here is my code:

from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
from airflow.models import Variable
import os
from airflow.operators.python_operator import PythonOperator
#Import boto3 module
import boto3
import logging
from botocore.exceptions import ClientError
import csv
import numpy as np
import pandas as pd

bucket='my_bucket_name'

s3 = boto3.resource('s3',
    aws_access_key_id=ACCESS_KEY,
    aws_secret_access_key=SECRET_KEY
    )


def load_into_csv(years):
    df = pd.DataFrame()
    for year in years:
        for buckett in s3.buckets.all():
            for aobj in buckett.objects.filter(Bucket=bucket,Prefix=PREFIX):
                if year in aobj.key:
                    bucket_name= "'{}'  ".format(buckett.name)
                    the_key= "'{}'  ".format(aobj.key)
                    last_mod= "'{}'  ".format(aobj.last_modified)
                    stor_class= "'{}'  ".format(aobj.storage_class)
                    size_1= "'{}'  ".format(aobj.size)
                    dd = {'bucket_name':[bucket_name], 'S3_key_path':[the_key], 'last_modified_date':[last_mod], 'storage_class':[stor_class], 'size':[size_1] }
                    df_2 = pd.DataFrame(data=dd)
                    df = df.append(df_2, ignore_index=True)

                    #Get local directory 
                    path=os.getcwd()

                    export_csv = df.to_csv (r'{}/results.csv'.format(path) ,index = None, header=True)


load_into_csv(years)


#######################################################################################################################

default_args = {
    'owner': 'name',
    'depends_on_past': False,
    'start_date': datetime(2020,1,1),
    'email': ['email@aol.com'],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 0,
    'retry_delay': timedelta(minutes=1)
}


dag = DAG('bo_v1',
          description = 'this is a test script',
          default_args=default_args,
          schedule_interval= '@once',
          catchup = False )


years=['2017','2018','2019']

for year in years:
    t1 = PythonOperator(
        task_id='load 2017',
        python_callable= load_into_csv,
        provide_context=False,
        dag = dag)

If you look at the path variable, i try collecting the local os path and then setting that as the output file in the export csv variable-but to no avail.

Is there a way to set your local MacOS file path (/Users/name/path/file_name.csv) as the filepath in the export_to_csv variable? I am new to Airflow so any ideas or suggestions would help!!!

Coder123
  • 334
  • 6
  • 26
  • Generally, if you wish to retain the results of any file processing from an airflow job you cannot download it directly locally (since the processing happens on distributed workers), so have you tried uploading the file to cloud storage (s3 or gcp buckets) instead? – manesioz Jan 21 '20 at 21:23

1 Answers1

0

I have tried the path dynamic way on my Mac as you did with path=os.getcwd(). I put it in the task or global namespace, but it didn't return any feasible path. One way you can work around this is to put the path as a variable in Airflow variables, and then get it from there when you need it.

hsaltan
  • 461
  • 1
  • 3
  • 13