Questions tagged [data-ingestion]

248 questions
1
vote
0 answers

Databricks delta live tables stuck when ingest file from S3

I'm new to databricks and just created a delta live tables to ingest 60 millions json file from S3. However the input rate (the number of files that it read from S3) is stuck at around 8 records/s, which is very low IMO. I have increased the number…
1
vote
1 answer

How to write an ingest pipeline for elastic search to load a csv file as nested JSONs?

I have a csv file that has the following format: company_id year sales buys location 3 2020 230 112 europe 3 2019 234 231 europe 2 2020 443 351 usa 2 2019 224 256 usa and when I import it to elastic search I end up having one entry…
StefSco
  • 23
  • 3
1
vote
3 answers

Best way to ingest data to bigquery

I have heterogeneous sources like flat files residing on prem, json on share point, api which serves data so and so. Which is the best etl tool to bring data to bigquery environment ? Im a kinder garden student in GCP :) Thanks in advance
vignesh
  • 1,414
  • 5
  • 19
  • 38
1
vote
1 answer

write only when all tables are valid with databricks and delta table

I'm looping through some CSV files in a folder. I want to write these CSV files as delta table only if they are all valid. Each CSV files in a folder as different name and schemas. I want to reject the entire folder and all the files it contains…
Simon Breton
  • 2,638
  • 7
  • 50
  • 105
1
vote
1 answer

Azure Data Explorer Stream Ingest formatted JSON Documents

We ingest JSON messages from Event Hub into Azure Data Explorer via Stream Ingestion. I created a table with this statement .create table messages(SerialNumber: string, ReceivedUtcTime: datetime, IngestEventEnqueuedUtcTime: datetime,…
Markus S.
  • 2,602
  • 13
  • 44
1
vote
1 answer

Data Ingestion in azure data lake

I Have a requirement where I need to ingest continuous/steam data(Json format) from eventHub to Azure data lake. I want to follow the layered approach(raw, clean, prepared) to finally store data into delta table. My doubt is around the raw…
1
vote
0 answers

MarkLogic Splitting Large XML Files Into Multiple Documents

If we have such input file: $ cat > example.xml George Washington Betsy
Den_Alex
  • 51
  • 2
1
vote
1 answer

Does WHEN clause in an insert all query when loading into multiple tables in Snowflake add a virtual field over each row and then load in bulk?

How WHEN clause evaluate values of columns in order to insert only new values and skip existing ones when using the following query: INSERT ALL WHEN (SELECT COUNT(*) FROM DEST WHERE DEST.ID = NEW_ID) = 0 THEN INSERT INTO DEST (ID) VALUES…
alim1990
  • 4,656
  • 12
  • 67
  • 130
1
vote
1 answer

How can we use parallel loading in data warehouse ingest scripts to load into multiple tables at the same time without duplications?

is it possible to load data into multiple tables using INSERT ALL without adding duplications or without using overwrite to accomplish it? As WHEN clause doesn't support subqueries unless it returns a value to compare with something else, i am…
alim1990
  • 4,656
  • 12
  • 67
  • 130
1
vote
1 answer

Azure Cosmos DB CSV upload

I am opening a CSV file in Python in Pycharm, then I want to upload it to my Container in Cosmos DB. It's not working. if os.path.exists(csv_file): with codecs.open(csv_file, 'rb', encoding="utf-8") as csv: csv_reader = DictReader(csv) …
1
vote
1 answer

Loading plain text dates in Spark v3 from CSV

I am trying to ingest a very basic CSV file with dates in Apache Spark. The complexity resides in the months being spelled out. For analytics purposes, I'd like to have those months as a date. Here is my CSV file: Period,Total "January…
jgp
  • 2,069
  • 1
  • 21
  • 40
1
vote
1 answer

Skip Header row when loading data from csv using Ingest Utility in db2

I am trying to load data into a db2 target table from a csv file using the ingest utility. I see the header row getting rejected with an error message. Is there any option (similar to skipcount in import utility) to skip the header row so to avoid…
vineeth
  • 641
  • 4
  • 11
  • 25
1
vote
1 answer

How to insert/ingest Current timestamp into kusto table

I am trying to insert current datetime into table which has Datetime as datatype using the following query: .ingest inline into table NoARR_Rollout_Status_Dummie <| @'datetime(2021-06-11)',Sam,Chay,Yes Table was created using the following…
A D
  • 51
  • 2
  • 12
1
vote
1 answer

Single data ingestion service vs multiple individual microservices?

I am trying to understand the pros and cons when having a single data ingestion microservice versus multiple individual microservices for each source of data. The context: There are multiple sources of data that I need to get retrieve customer data…
Mahir Hiro
  • 135
  • 1
  • 7
1
vote
1 answer

How to parse data in a variety of data formats/structures?

I'm terribly unfamiliar with the data engineering space, but here goes: I have users that upload data in a variety of formats, that I want to convert to a single standard format. For example: Source Format #1 { "firstName": "Bob", "lastName":…
diplosaurus
  • 2,538
  • 5
  • 25
  • 53