Questions tagged [data-engineering]

69 questions
0
votes
1 answer

Batch processing for ML pipeline - questions in regards to data ingestion and data storage

I am working on a batch data pipeline for a ML application (large scale) and have some questions Data Ingestion Layer: In most sources that I have read so far Kafka is suggested for pulling data from the original source (a CSV in my case). When…
0
votes
1 answer

How to find the difference between last value of current month last value and previous month using python

I have a doubt and would require a small support on Python. I wish to calculate the difference of Last value of Current month - Last value of Previous month ( this is tag wise) Then, the total sum of value calculated from number 1. against per…
Nishad Nazar
  • 371
  • 2
  • 3
  • 16
0
votes
0 answers

dbt found two resources with the database representation

I have two .sql dbt models models/A/claim.sql and models/A/prod_claim.sql. My goal is to create claim table in two different db/schema that is mentioned in the profiles.yaml as different targets. They have different tags mentioned in the schema.yaml…
Ajmal Moideen
  • 170
  • 1
  • 6
  • 17
0
votes
1 answer

PySpark function to handle null values with poor performance - Need optimization suggestions

I have a PySpark function called fillnulls that handles null values in my dataset by filling them with appropriate values based on the column type. However, I've noticed that the function's performance is not optimal, especially when dealing with…
Shaked Nave
  • 55
  • 2
  • 6
0
votes
0 answers

Dynamic Staging bucket argument in Apache Beam dataflow pipeline (GCP)

I'm using below function: def final_pipeline(): with beam.Pipeline(options=pipeline_options) as pipeline: readable_files = ( pipeline |…
0
votes
1 answer

PySpark loading from MySQL ends up loading the entire table?

I am quite new to PySpark (or Spark in general). I am trying to connect Spark with a MySQL instance I have running on RDS. When I load the table like so, does Spark load the entire table in memory? from pyspark.sql import SparkSession spark =…
0
votes
0 answers

Changing table arrangement in Python

There is the following table describing for each ID the results of tests. Under each subject column there is a grade of either teacher 1 or teacher 2 according to the column to the right of it and under the teacher 1 or teacher 2 columns the name of…
Bar Fichman
  • 19
  • 1
  • 1
  • 3
0
votes
1 answer

Multiple statement task in Snowflake error

I've created this task in SF using SnowSQL: CREATE OR REPLACE TASK sv_copy_command_test_task WAREHOUSE = compute_wh SCHEDULE = '1 MINUTE' AS BEGIN SET DATE_PATTERN = CONCAT('CALLS/', TO_CHAR(CURRENT_DATE(), 'YYYY/MM/DD'), '/',…
0
votes
0 answers

Error copying list from sharepoint to Azure Blob Storage using Data Factory

I'm facing an issue when trying to transfer a list from Sharepoint to Azure Blob Storage using Azure Data Factory, through the Sharepoint Online List connector. Unfortunately, I'm receiving a generic error that doesn't provide specific information…
0
votes
1 answer

How to create a Snowflake Task that runs two queries

I want to create a task that executes a copy command every minute in snowflake, the problem is that the path of the file that needs to be taken (the files are stored as parquet files in s3), it is determined by the current date, for that I have this…
0
votes
1 answer

How much time SnowPipe keeps track of files that has being already loaded

I've created a SnowPipe to load continuos data from an S3 Bucket. In the S3 Bucket I have the data compressed in parquet files, but time to time maybe this data is loaded again and it is replacing the old parquet file with the new one (when the data…
0
votes
1 answer

How to concate 2 flowfiles in Apache NiFi based on the column value (csv)?

I am new to NiFi, and I am currently working on a task where I have a flowfile: "a","b","c","d" "abc","jfx","daw","123" "eqw","poq","djw","456" And another flowfile: "d","e","f","g" "123","VVV","010","dv2" "412","GGG","188","kw2" I need to…
0
votes
0 answers

My Airflow Run doesnt work stuck at running

from datetime import datetime, timedelta import psycopg2 import pandas as pd from airflow import DAG from airflow.operators.python import PythonOperator import logging import boto3 logger = logging.getLogger(__name__) logger.setLevel(logging.INFO) …
0
votes
0 answers

Pentaho Run SSH Commands Step - Can't see private ssh key

I am trying to set up an ssh connection in Pentaho DI before connecting to the database. If I understand correctly, there is a "Run SSH commands" block in Pentaho to configure and run the ssh tunnel. enter image description here I enter the…
0
votes
1 answer

Receiving -bash: mage: command not found, when trying to start a new Mage AI project on Google's Compute SSH

After installing Mage AI with this command: sudo pip3 install mage-ai , I keep receiving -bash: mage: command not found, when trying to start a new Mage AI project on Google's Compute SSH. Can anyone assist me in troubleshooting this error? I tried…
Jay
  • 1