Highest Voted 'data-engineering' Questions

0

votes

1 answer

How to write a swagger file to capture the information within the <> block?

I am working on building data pipeline for an API call whose response follows the below format: 5555 5555 Toronto Training 1 …

asked Jun 28 '23 at 16:56

La. Li

41
6

0

votes

0 answers

Spark - Optimisation - Explode and Collect List for a big fact table

I’m trying to optimise a job which is currently running for 2 hours. It is using rdd’s and converting that into DF’s make it even more worse. Here is what I did: I have a very big fact table where a column is an array of structs. I can only join…

dataframe apache-spark optimization bigdata data-engineering

asked Jun 27 '23 at 13:42

user3294361

21
2

0

votes

1 answer

Programmatic Create Directory, Upload files, list DBFS - Databricks API - Azure

Trying to make directory using Databricks API, struggling to find the right placeholders in cURL request, please help! Tried to create directory using mkdirs, no luck! curl -X POST https://${DATABRICKS_HOST}/api/2.0/dbfs/mkdirs -H "Authorization:…

curl databricks azure-databricks data-engineering

asked Jun 21 '23 at 10:31

Vaitheesh

1

0

votes

1 answer

Using Pandas,loading data from excel to redshift using python, able to load data. but when excel has 20000+ rows taking 7+ hours. way to optimize

I am experiencing slow performance when using Pandas to load data from an Excel file into an existing Redshift table. The Excel file has 10+ columns and 20000+ rows, and the operation is taking over 7 hours to complete. Is there a way to optimize…

python pandas amazon-redshift psycopg2 data-engineering

asked Jun 15 '23 at 22:48

Lakshmi Reddy

1
1

0

votes

0 answers

Schema changing and handling column datatype errors during .xls ETL with Python in AWS

I've been working with .xls files for almost 2 years and I finally decided to clear my thoughts over this topic. Let me explain my scenario: I work with many sources (databases, api, ...), but I also have to read almost 20 excel files from the…

excel amazon-web-services etl aws-glue data-engineering

asked Jun 15 '23 at 04:08

Lucas Saito

63
6

0

votes

1 answer

Records in single JSON file is different , how can i handle this type of issue in ADF

I have one attribute name shipment in the JSON file there are 100 records total in one file. In some records shipment is in this format "shipment":[ ] while in some records shipment is in this format "shipment": { "id": 171700,…

azure azure-functions azure-data-factory data-engineering

asked Jun 14 '23 at 06:02

junaidbilal

7
3

0

votes

0 answers

How to increase performance of API that performs aggregations on a mutable table in Postgres

Source table is mutable so it doesnt keep history. I asked this question to ChatGPT and I got following pointers to increase performance: Indexing, Caching, Materialized views, Partitioning, Denormalization, Query optimization, Scaling through…

postgresql google-bigquery aggregate data-engineering

asked Jun 13 '23 at 21:00

Aseem

5,848
7
45
69

0

votes

0 answers

Apache Superset

Can someone please assist. I have a dashboard on Superset connected to a dataset in BigQuery. One can go and apply filters on the dashboard. After applying the filters, I want to insert a complete button that one can select to generate the SQL…

dashboard apache-superset data-engineering

asked Jun 13 '23 at 06:38

Jana

1

0

votes

1 answer

Dynamic script executions on AzureDataBricks and Synapse Serverless

I have a table in SQL Server that has a few columns: ViewName, ExternalTableDefinition_Synapse, ExternalTableDefinition_ADB , ViewDefinition_Synapse, ViewDefinition_ADB There may be hundreds of thousands of records here. The task is to iterate over…

sql-server dynamic-programming azure-databricks azure-synapse data-engineering

asked Jun 08 '23 at 07:19

SouravA

5,147
2
24
49

0

votes

1 answer

Acks in Apache Kafka

I wonder why Kafka had “acks” (Producer’s config for acknowledgement of delivery) as 1 by default rather than “all” until version 3.0 when the latter offers high durability and hence consistency across all the replicas? As per my…

apache-kafka data-engineering

asked Jun 05 '23 at 20:50

user11672850

11
1
4

0

votes

0 answers

BulkWriteError when writing large amount of data into MongoDB using PySpark

I am getting bulkwrite Error issue while writing df into mongodb. df.write\ .format("com.mongodb.spark.sql.DefaultSource")\ .option("uri", connection_string)\ .option("database", db)\ .option("collection", collection)\ .option("ordered",…

mongodb azure pyspark databricks data-engineering

asked May 31 '23 at 07:19

Atul Maurya

1

0

votes

1 answer

dbt is not able to find my source defined in sources.yaml

I'm setting up a new dbt project and trying to define a source then use it in a downstream model. Here is my sources.yml located under models folder version: 2 sources: - name: raw schema: my_schema database: my_db tables: -…

amazon-redshift dbt data-engineering

asked May 30 '23 at 06:17

Lee

2,874
3
27
51

0

votes

1 answer

Creating a column from another column using a conditional in SQL

I have a column in my database that contains campaign_name, the country codes are present in the names and I want to be able to extract them and transform them into their normal names as well as create a new column called country name that includes…

sql google-bigquery data-engineering

asked May 25 '23 at 12:07

ekimebg

13
1

0

votes

0 answers

Data processing time in big data cluster

I need some help with the following question asked in an interview: Suppose the cluster size is 5, Driver -> 1, Worker -> 4, each worker has -> 1 executor, each executor has -> 4 cores we have 40 parts of data , one part requires 5 min processing…

bigdata data-processing data-engineering

asked May 18 '23 at 11:03

Daniel

1

0

votes

0 answers

I can't plot the signals filtered by the Butterworth filter in Python, incomplete ranges appear, and it processes very slowly

hope you are doing well I'm trying to process 2 sygnals, first I need to apply the nyquist theorem and then a filter, in my case Butterworth. I plot the normal signal as 'Normal_0' and the fault data 'IRDE12_0.mat', I take X097_DE_time and…

python butterworth data-engineering

asked May 16 '23 at 19:05

Luis H

1
2

Questions tagged [data-engineering]