Questions tagged [data-engineering]
69 questions
0
votes
1 answer
Batch processing for ML pipeline - questions in regards to data ingestion and data storage
I am working on a batch data pipeline for a ML application (large scale) and have some questions
Data Ingestion Layer: In most sources that I have read so far Kafka is suggested for pulling data from the original source (a CSV in my case). When…

user22316802
- 1
- 2
0
votes
1 answer
How to find the difference between last value of current month last value and previous month using python
I have a doubt and would require a small support on Python.
I wish to calculate the difference of Last value of Current month - Last value of Previous month ( this is tag wise)
Then, the total sum of value calculated from number 1. against per…

Nishad Nazar
- 371
- 2
- 3
- 16
0
votes
0 answers
dbt found two resources with the database representation
I have two .sql dbt models models/A/claim.sql and models/A/prod_claim.sql.
My goal is to create claim table in two different db/schema that is mentioned in the profiles.yaml as different targets.
They have different tags mentioned in the schema.yaml…

Ajmal Moideen
- 170
- 1
- 6
- 17
0
votes
1 answer
PySpark function to handle null values with poor performance - Need optimization suggestions
I have a PySpark function called fillnulls that handles null values in my dataset by filling them with appropriate values based on the column type. However, I've noticed that the function's performance is not optimal, especially when dealing with…

Shaked Nave
- 55
- 2
- 6
0
votes
0 answers
Dynamic Staging bucket argument in Apache Beam dataflow pipeline (GCP)
I'm using below function:
def final_pipeline():
with beam.Pipeline(options=pipeline_options) as pipeline:
readable_files = (
pipeline
|…

Rohan Anand
- 3
- 4
0
votes
1 answer
PySpark loading from MySQL ends up loading the entire table?
I am quite new to PySpark (or Spark in general). I am trying to connect Spark with a MySQL instance I have running on RDS. When I load the table like so, does Spark load the entire table in memory?
from pyspark.sql import SparkSession
spark =…

Bhargav Panth
- 301
- 1
- 3
- 10
0
votes
0 answers
Changing table arrangement in Python
There is the following table describing for each ID the results of tests. Under each subject column there is a grade of either teacher 1 or teacher 2 according to the column to the right of it and under the teacher 1 or teacher 2 columns the name of…

Bar Fichman
- 19
- 1
- 1
- 3
0
votes
1 answer
Multiple statement task in Snowflake error
I've created this task in SF using SnowSQL:
CREATE OR REPLACE TASK sv_copy_command_test_task
WAREHOUSE = compute_wh
SCHEDULE = '1 MINUTE'
AS
BEGIN
SET DATE_PATTERN = CONCAT('CALLS/', TO_CHAR(CURRENT_DATE(), 'YYYY/MM/DD'), '/',…

svalls
- 11
- 3
0
votes
0 answers
Error copying list from sharepoint to Azure Blob Storage using Data Factory
I'm facing an issue when trying to transfer a list from Sharepoint to Azure Blob Storage using Azure Data Factory, through the Sharepoint Online List connector. Unfortunately, I'm receiving a generic error that doesn't provide specific information…
0
votes
1 answer
How to create a Snowflake Task that runs two queries
I want to create a task that executes a copy command every minute in snowflake, the problem is that the path of the file that needs to be taken (the files are stored as parquet files in s3), it is determined by the current date, for that I have this…

svalls
- 11
- 3
0
votes
1 answer
How much time SnowPipe keeps track of files that has being already loaded
I've created a SnowPipe to load continuos data from an S3 Bucket. In the S3 Bucket I have the data compressed in parquet files, but time to time maybe this data is loaded again and it is replacing the old parquet file with the new one (when the data…

svalls
- 11
- 3
0
votes
1 answer
How to concate 2 flowfiles in Apache NiFi based on the column value (csv)?
I am new to NiFi, and I am currently working on a task where I have a flowfile:
"a","b","c","d"
"abc","jfx","daw","123"
"eqw","poq","djw","456"
And another flowfile:
"d","e","f","g"
"123","VVV","010","dv2"
"412","GGG","188","kw2"
I need to…
0
votes
0 answers
My Airflow Run doesnt work stuck at running
from datetime import datetime, timedelta
import psycopg2
import pandas as pd
from airflow import DAG
from airflow.operators.python import PythonOperator
import logging
import boto3
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO) …
0
votes
0 answers
Pentaho Run SSH Commands Step - Can't see private ssh key
I am trying to set up an ssh connection in Pentaho DI before connecting to the database. If I understand correctly, there is a "Run SSH commands" block in Pentaho to configure and run the ssh tunnel.
enter image description here
I enter the…

Alejandro Norte
- 1
- 1
0
votes
1 answer
Receiving -bash: mage: command not found, when trying to start a new Mage AI project on Google's Compute SSH
After installing Mage AI with this command:
sudo pip3 install mage-ai ,
I keep receiving -bash: mage: command not found, when trying to start a new Mage AI project on Google's Compute SSH.
Can anyone assist me in troubleshooting this error?
I tried…

Jay
- 1