Questions tagged [data-engineering]

69 questions
0
votes
0 answers

Manage account in Azure Databricks

hi every one, when I try to go to manage an account in Azure Databricks, and select my workspace return to onboarding and can't access to manage account. try more and no result found from more than 3 days and create another workspace and still…
0
votes
1 answer

Spark continuous structured streaming not showing input rate or process rate metrics

I'm running my spark continuous structured streaming application on a standalone cluster. However I noticed that metrics like average input/sec or avg process/sec is not showing(as NaN) on the structured streaming UI. I have…
0
votes
0 answers

Integrating Airbyte as a multi-container app in the main docker-compose.yml

I am building a data pipeline with Airbyte, PostgreSQL and dbt. PostgreSQL and DBT I can easily set up via my main docker-compose.yml but with Airbyte I am not sure. Airbyte itself is a multi-container app so it has it's own docker-compose.yml. To…
0
votes
1 answer

How to loop almost 1 million rows from the bigquery with python

I am a newbie, I just started a query where I have ~1 million rows on bigquery and it has 25 columns. Rows have the type is RowIterator I wrote a script in Python to loop them and process data. I used: client = bigquery.Client() query_job =…
D9SeveN
  • 1
  • 2
0
votes
0 answers

Why do I get a KeyError in this Mage Data Pipeline?

I am attempting to enrich a dataset with zip codes from the Chicago Data Portal. The chicago crimes dataset can be found at https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-Present/ijzp-q8t2 and the geographic data for zip codes can be…
Cole
  • 95
  • 1
  • 2
  • 5
0
votes
2 answers

Using Python to Remove a row in Excel based on Cell Value

I'm attempting to clean an excel file prior to sending it up to the database for calculations. By default when the Excel Report is exported out of our system (NextGen) it attaches a row that calculates a Sum Total of data throughout the report based…
0
votes
0 answers

OOM error while reading .parquet file. How do I solve this?

I am working on a ETL project. For that I am trying to read a .parquet file in order to see, transform the data and upload it. I´ve been failing with that as I always get an "OOM error" while reading it. Is there some way I could read this…
mdein
  • 1
0
votes
1 answer

How to Calculate GPA's in MATLAB

Ive just began learning MATLAB I am in engineering school and we were given a problem to solve in matlab the problem is as follows (also attached): The text file called Transcript.txt lists the courses, grades and credits for a student transcript…
0
votes
1 answer

ADF Data flow expression

I am trying to build ADF data flow select operation to dynamically select column names. I am receiving required column names in an array parameter named 'colNames' and then I am trying to use that in data flow expression to check if column name in…
0
votes
0 answers

Constraint constraints/compute.requireOsLogin violated for project (project id)

While creating the Data Quality task in dataplex i am facing the issue as Constraint constraints/compute.requireOsLogin violated for project. I have check all the task configuration but i am not able to find anything related this error.
0
votes
1 answer

Is it possible to build seed dataset/table over multiple files in DBT?

Is it possible to build seed dataset/table over multiple files in DBT? I have two data files like below in my dbt project Building seed dataset/table on individual file works perfectly fine. However, what I am looking for is to create one seed…
0
votes
2 answers

handle dynamic number of columns (csv file) in pyspark

I am getting the CSV file below (without the header) - D,neel,32,1,pin1,state1,male D,sani,31,2,pin1,state1,pin2,state2,female D,raja,33,3,pin1,state1,pin2,state2,pin3,state3,male I want to create the CSV file below using pyspark dataframe…
0
votes
1 answer

open source data stack - Airbyte, Airflow, ?,?

I am building a open source data stack for a large-scale batch pipeline. The data is later to be used in a ML model that is updated quarterly. I want to use Airbyte for ingestion and Airflow for generel orchestration. In general, I want to use…
0
votes
0 answers

Importing data after DACPAC

I am in the process of transferring a database from one Azure environment to another Azure environment. The old database must remain intact and I would like to do the deployment via a release pipeline in devops. I created a new database project with…
0
votes
1 answer

Kafka stream in DATABRICKS increases a lot of data

When I perform a Kafka write stream to a table in Databricks, the incoming data doesn't increase the table size significantly, but it results in a much larger increase in the data size on Blob storage. val kafkaBrokers="" val…