Questions tagged [data-engineering]
69 questions
0
votes
1 answer
How to write a swagger file to capture the information within the <> block?
I am working on building data pipeline for an API call whose response follows the below format:
5555
5555
Toronto Training 1
…

La. Li
- 41
- 6
0
votes
0 answers
Spark - Optimisation - Explode and Collect List for a big fact table
I’m trying to optimise a job which is currently running for 2 hours. It is using rdd’s and converting that into DF’s make it even more worse. Here is what I did:
I have a very big fact table where a column is an array of structs. I can only join…

user3294361
- 21
- 2
0
votes
1 answer
Programmatic Create Directory, Upload files, list DBFS - Databricks API - Azure
Trying to make directory using Databricks API, struggling to find the right placeholders in cURL request, please help!
Tried to create directory using mkdirs, no luck!
curl -X POST https://${DATABRICKS_HOST}/api/2.0/dbfs/mkdirs -H "Authorization:…
0
votes
1 answer
Using Pandas,loading data from excel to redshift using python, able to load data. but when excel has 20000+ rows taking 7+ hours. way to optimize
I am experiencing slow performance when using Pandas to load data from an Excel file into an existing Redshift table. The Excel file has 10+ columns and 20000+ rows, and the operation is taking over 7 hours to complete. Is there a way to optimize…

Lakshmi Reddy
- 1
- 1
0
votes
0 answers
Schema changing and handling column datatype errors during .xls ETL with Python in AWS
I've been working with .xls files for almost 2 years and I finally decided to clear my thoughts over this topic. Let me explain my scenario: I work with many sources (databases, api, ...), but I also have to read almost 20 excel files from the…

Lucas Saito
- 63
- 6
0
votes
1 answer
Records in single JSON file is different , how can i handle this type of issue in ADF
I have one attribute name shipment in the JSON file there are 100 records total in one file.
In some records shipment is in this format "shipment":[ ]
while in some records shipment is in this format
"shipment": { "id": 171700,…

junaidbilal
- 7
- 3
0
votes
0 answers
How to increase performance of API that performs aggregations on a mutable table in Postgres
Source table is mutable so it doesnt keep history.
I asked this question to ChatGPT and I got following pointers to increase performance:
Indexing, Caching, Materialized views, Partitioning, Denormalization,
Query optimization, Scaling through…

Aseem
- 5,848
- 7
- 45
- 69
0
votes
0 answers
Apache Superset
Can someone please assist. I have a dashboard on Superset connected to a dataset in BigQuery. One can go and apply filters on the dashboard. After applying the filters, I want to insert a complete button that one can select to generate the SQL…

Jana
- 1
0
votes
1 answer
Dynamic script executions on AzureDataBricks and Synapse Serverless
I have a table in SQL Server that has a few columns: ViewName, ExternalTableDefinition_Synapse, ExternalTableDefinition_ADB , ViewDefinition_Synapse, ViewDefinition_ADB
There may be hundreds of thousands of records here.
The task is to iterate over…

SouravA
- 5,147
- 2
- 24
- 49
0
votes
1 answer
Acks in Apache Kafka
I wonder why Kafka had “acks” (Producer’s config for acknowledgement of delivery) as 1 by default rather than “all” until version 3.0 when the latter offers high durability and hence consistency across all the replicas?
As per my…

user11672850
- 11
- 1
- 4
0
votes
0 answers
BulkWriteError when writing large amount of data into MongoDB using PySpark
I am getting bulkwrite Error issue while writing df into mongodb.
df.write\
.format("com.mongodb.spark.sql.DefaultSource")\
.option("uri", connection_string)\
.option("database", db)\
.option("collection", collection)\
.option("ordered",…
0
votes
1 answer
dbt is not able to find my source defined in sources.yaml
I'm setting up a new dbt project and trying to define a source then use it in a downstream model.
Here is my sources.yml located under models folder
version: 2
sources:
- name: raw
schema: my_schema
database: my_db
tables:
-…

Lee
- 2,874
- 3
- 27
- 51
0
votes
1 answer
Creating a column from another column using a conditional in SQL
I have a column in my database that contains campaign_name, the country codes are present in the names and I want to be able to extract them and transform them into their normal names as well as create a new column called country name that includes…

ekimebg
- 13
- 1
0
votes
0 answers
Data processing time in big data cluster
I need some help with the following question asked in an interview:
Suppose the cluster size is 5, Driver -> 1, Worker -> 4, each worker has -> 1 executor, each executor has -> 4 cores
we have 40 parts of data , one part requires 5 min processing…

Daniel
- 1
0
votes
0 answers
I can't plot the signals filtered by the Butterworth filter in Python, incomplete ranges appear, and it processes very slowly
hope you are doing well
I'm trying to process 2 sygnals, first I need to apply the nyquist theorem and then a filter, in my case Butterworth. I plot the normal signal as 'Normal_0' and the fault data 'IRDE12_0.mat', I take
X097_DE_time and…

Luis H
- 1
- 2