Questions tagged [data-engineering]

69 questions
0
votes
1 answer

How to write a swagger file to capture the information within the <> block?

I am working on building data pipeline for an API call whose response follows the below format: 5555 5555 Toronto Training 1
La. Li
  • 41
  • 6
0
votes
0 answers

Spark - Optimisation - Explode and Collect List for a big fact table

I’m trying to optimise a job which is currently running for 2 hours. It is using rdd’s and converting that into DF’s make it even more worse. Here is what I did: I have a very big fact table where a column is an array of structs. I can only join…
0
votes
1 answer

Programmatic Create Directory, Upload files, list DBFS - Databricks API - Azure

Trying to make directory using Databricks API, struggling to find the right placeholders in cURL request, please help! Tried to create directory using mkdirs, no luck! curl -X POST https://${DATABRICKS_HOST}/api/2.0/dbfs/mkdirs -H "Authorization:…
0
votes
1 answer

Using Pandas,loading data from excel to redshift using python, able to load data. but when excel has 20000+ rows taking 7+ hours. way to optimize

I am experiencing slow performance when using Pandas to load data from an Excel file into an existing Redshift table. The Excel file has 10+ columns and 20000+ rows, and the operation is taking over 7 hours to complete. Is there a way to optimize…
0
votes
0 answers

Schema changing and handling column datatype errors during .xls ETL with Python in AWS

I've been working with .xls files for almost 2 years and I finally decided to clear my thoughts over this topic. Let me explain my scenario: I work with many sources (databases, api, ...), but I also have to read almost 20 excel files from the…
0
votes
1 answer

Records in single JSON file is different , how can i handle this type of issue in ADF

I have one attribute name shipment in the JSON file there are 100 records total in one file. In some records shipment is in this format "shipment":[ ] while in some records shipment is in this format "shipment": { "id": 171700,…
0
votes
0 answers

How to increase performance of API that performs aggregations on a mutable table in Postgres

Source table is mutable so it doesnt keep history. I asked this question to ChatGPT and I got following pointers to increase performance: Indexing, Caching, Materialized views, Partitioning, Denormalization, Query optimization, Scaling through…
Aseem
  • 5,848
  • 7
  • 45
  • 69
0
votes
0 answers

Apache Superset

Can someone please assist. I have a dashboard on Superset connected to a dataset in BigQuery. One can go and apply filters on the dashboard. After applying the filters, I want to insert a complete button that one can select to generate the SQL…
Jana
  • 1
0
votes
1 answer

Dynamic script executions on AzureDataBricks and Synapse Serverless

I have a table in SQL Server that has a few columns: ViewName, ExternalTableDefinition_Synapse, ExternalTableDefinition_ADB , ViewDefinition_Synapse, ViewDefinition_ADB There may be hundreds of thousands of records here. The task is to iterate over…
0
votes
1 answer

Acks in Apache Kafka

I wonder why Kafka had “acks” (Producer’s config for acknowledgement of delivery) as 1 by default rather than “all” until version 3.0 when the latter offers high durability and hence consistency across all the replicas? As per my…
user11672850
  • 11
  • 1
  • 4
0
votes
0 answers

BulkWriteError when writing large amount of data into MongoDB using PySpark

I am getting bulkwrite Error issue while writing df into mongodb. df.write\ .format("com.mongodb.spark.sql.DefaultSource")\ .option("uri", connection_string)\ .option("database", db)\ .option("collection", collection)\ .option("ordered",…
0
votes
1 answer

dbt is not able to find my source defined in sources.yaml

I'm setting up a new dbt project and trying to define a source then use it in a downstream model. Here is my sources.yml located under models folder version: 2 sources: - name: raw schema: my_schema database: my_db tables: -…
Lee
  • 2,874
  • 3
  • 27
  • 51
0
votes
1 answer

Creating a column from another column using a conditional in SQL

I have a column in my database that contains campaign_name, the country codes are present in the names and I want to be able to extract them and transform them into their normal names as well as create a new column called country name that includes…
ekimebg
  • 13
  • 1
0
votes
0 answers

Data processing time in big data cluster

I need some help with the following question asked in an interview: Suppose the cluster size is 5, Driver -> 1, Worker -> 4, each worker has -> 1 executor, each executor has -> 4 cores we have 40 parts of data , one part requires 5 min processing…
Daniel
  • 1
0
votes
0 answers

I can't plot the signals filtered by the Butterworth filter in Python, incomplete ranges appear, and it processes very slowly

hope you are doing well I'm trying to process 2 sygnals, first I need to apply the nyquist theorem and then a filter, in my case Butterworth. I plot the normal signal as 'Normal_0' and the fault data 'IRDE12_0.mat', I take X097_DE_time and…
Luis H
  • 1
  • 2