Questions tagged [data-pipeline]

168 questions
2
votes
2 answers

luigi upstream task should run once to create input for set of downstream tasks

I have a nice straight working pipe, where the task I run via luigi on the command line triggers all the required upstream data fetch and processing in it's proper sequence till it trickles out into my database. class IMAP_Fetch(luigi.Task): …
ib4u
  • 43
  • 5
2
votes
0 answers

airflow big dag_pickle table

I set up a test installation of airflow a while ago with one test DAG which is in paused state. Now, after this system ran for some weeks without actually doing much (beside some test runs), I wanted to dump the database and realized, it is…
2
votes
1 answer

How can we provision number of core instances in AWS Data Pipeline job

Requirement: Restore DynamoDB table from S3 Backup location. We created Data Pipeline job, and then edit Resources section in Architect Wizard. We placed 20 instances under Core Instance count, but after the Data Pipeline job activation, EMR Cluster…
1
vote
0 answers

sqlite3.OperationalError: unable to open database file in Airflow

So I am trying to use sqlite database for my airflow project. I set up the database and connected it to my Airflow project (which I am running using Docker) airflow connections get podcasts The connection is pointed to the correct directory. But…
Tallion 22
  • 141
  • 1
  • 11
1
vote
0 answers

Ghost jobs that continue to work despite the removal of dag

I am trying to understand how airflow works, I have a problem i created a dag in the order to check if my flask api works correctly, if it is not the case it must send an email. Attached is the code def check_api_prod(id): data = "" …
ensberg
  • 47
  • 6
1
vote
1 answer

How do I add parameters to a data pipeline in Foundry?

I have a PySpark data pipeline that I want to parameterize in Foundry. How can I get Foundry to read a “configuration file” when running the pipeline? I want the pipeline to be flexible and parameterizable, but if I define the parameters in the…
1
vote
0 answers

Design Airflow Pipeline to perform sequential tasks based on payload sent to Airflow REST API

I intend to create a data pipeline in Airflow that reads data from a source and perform multiple operations on the data in sequence and outputs a file. Example 1 payload to rest api : { 'inputFileLocation': '[input s3 key]' , 'outputFileLocation':…
1
vote
0 answers

Use CI/CD Pipeline in Gitlab to periodically retrieve data from a REST endpoint and save to Gitlab repo

I've got a Dash Enterprise app that uses some monthly data reports. Every month I manually upload the new data files which are stored at a Rest endpoint. I want to automate this process using GitLab. Can the GitLab CI/CD Pipeline retrieve data from…
russhoppa
  • 56
  • 7
1
vote
0 answers

Data Pipeline: from AWS RDS Mysql to S3 bucket using python

I want to build a data pipeline using python. The current data source is AWS RDS Mysql and DynamoDB. I want to Fetch daily data as CSV file(i.e. for each date I will fetch incremental data and make one CSV file) and store them into S3 bucket. Rows…
1
vote
1 answer

Dynamic Target Delta Table As Target For Spark Streaming

We are processing a stream of web logs. Basically activities that users perform on website. For each activity they perform, we a separate activity delta table. We are exploring what is the best way to do streaming ingest. We have a kafka stream…
Ravi Patel
  • 121
  • 2
  • 6
1
vote
1 answer

ADF - how to compare two Azure SQL Database tables (A and B) with the same structure and to insert only the missing values from table A to table B

I want to create an ADF data pipeline that compares both tables and after the comparison to add the missing rows from table A to table B table A - 100 records table B - 90 records add the difference of 10 rows from table A to table B This is what I…
mkn
  • 11
  • 2
1
vote
1 answer

Azure ADF Data Pipeline- Multiple Activities to Single Activity Execution

Could someone please help with following scenario ? So this Data Pipeline has multiple activities (Set Variable) targeting to Single activity Send Email (Want to make Send email a Generic activity). So the Idea is to Capture error from each activity…
1
vote
1 answer

How to extract google analytics dimensions and metrics with python?

I was going to extract data from google analytics with Python and google analytics report api v4, in the middle I need the Google analytics Dimensions and matrix there for a specific purpose(so that I can give user to select for which dimensions and…
1
vote
1 answer

Camunda as scheduler and orchestrator of data-pipeline / ETL

I would like to know if anyone implemented Camunda as scheduler and orchestrator of data pipelines/ETL and can share his experience. What are the pros and cons of using it instead of Airflow for example? Thanks!
Alex k
  • 25
  • 5
1
vote
1 answer

Resource Allocation for Incremental Pipelines

There are times when an incremental pipeline in Palantir Foundry has to be built as a snapshot. If the data size is large, the resources to run the build are increased to reduce run time and then the configuration is removed after first snapshot…
1 2
3
11 12