Questions tagged [data-ingestion]
248 questions
2
votes
2 answers
Druid with Kafka Ingestion: filtering data
is it possible to filter data by dimension value during ingestion from Kafka to Druid?
e.g. Considering dimension: version, which might have values: v1, v2, v3 I would like to have only v2 loaded.
I realize it can be done using Spark/Flink/Kafka…

pcejrowski
- 603
- 5
- 15
2
votes
2 answers
Not able to load files larger than 100 MB into HDFS
I'm facing a really strange issue with my cluster.
Whenever I'm trying to load any file into HDFS that is larger than 100 MB(104857600 bytes) it fails with the following error:
All datanodes are bad... Aborting.
This is really strange as 100 MB…

Megh Vidani
- 635
- 1
- 7
- 22
2
votes
3 answers
If we using 6 mapper in sqoop to importing the data from Oracle, then how many connection will be establish between sqoop and source
If we using 6 mapper in sqoop to importing the data from Oracle, then how many connection will be establish between sqoop and source.
Will it be a single connection or it will be 6 connections for each mapper.

smisra3
- 107
- 1
- 12
2
votes
4 answers
Sqoop import multiple tables but not all
All the searches I've found show how to import one table or recommend the import-all-tables. What if I want 35 of 440 tables from my db. Can I just write one command and separate the tables by comma or do I have to put it in a script and copy and…

AM_Hawk
- 661
- 1
- 15
- 33
1
vote
0 answers
API doesn't support batch/bulk operations
I have a CSV file with 1.5 millions records, I need to call API to get the users email_address, unfortunately the API documents shows it doesn't support batch operation. Currently , for 1.5 millions records it will run about 3-4 hours, Is there…

Olivia Xu
- 11
- 1
1
vote
1 answer
MongoDB to Databricks Data Ingestion
I am working on creating a pipeline from MongoDB to Databricks.
Based on my research there are two ways of doing it:
MongoDB Change Streams
MongoDB-Databricks Connecor for Structured Streaming.
I am using Pyspark.
I am doing this to get all the…

Mayank Jain
- 11
- 2
1
vote
1 answer
Octavia apply Airbyte gives
I'm trying to create a new BigQuery destination on Airbyte with Octavia cli.
When launching:
octavia apply
I receive:
Error: {"message":"The provided configuration does not fulfill the specification. Errors: json schema validation failed when…

tdebroc
- 1,436
- 13
- 28
1
vote
0 answers
TimescaleDB: how to ingest files from s3?
In Postgres, a way to ingest files from s3 directly is through the aws_s3 extension, using table_import_from_s3 function for example.
However this is not directly supported by TimescaleDB as of now.
=> CREATE EXTENSION IF NOT EXISTS aws_s3 CASCADE;…

xmar
- 1,729
- 20
- 48
1
vote
0 answers
embed additional second dataframe into plot
I want my plot to retrieve data from one dataframe, but hovering over the data i want it to incorperate data from both data frames.
example:
which results from
fig = px.scatter(X_reduced_df, x='EXTRACTION_DATE_SAMPLE', y='score_IF', color=…

Danny
- 41
- 5
1
vote
1 answer
Delta live tables data quality checks -Retain failed records
There are 3 types of quality checks in Delta live tables:
expect (retain invalid records)
expect_or_drop (drop invalid records)
expect_or_fail (fail on invalid records)
I want to retain invalid records, but I also want to keep track of them. So,…

Ender
- 71
- 9
1
vote
0 answers
Create data syncs using two tables
I want to create a data syncs in Palantir using un update (update + insert) transaction on three fields from two diffent tables, there is anoption in Palantir syncs to use twin table but i can't see how to add three fields in the incremental field…

f.ivy
- 65
- 5
1
vote
1 answer
How can I do an incremental load based on record ID in Dagster
I am trying to consume an HTTP API in my Dagster code. The API provides a log of "changes" which contain an incrementing ID. It supports an optional parameter fromUpdateId, which lets you only fetch updates that have a higher ID than some…

Imre Kerr
- 2,388
- 14
- 34
1
vote
1 answer
Extracting data from Multiple Excel files with multiple tabs and multiple columns using Python
I'm trying to create a data ingestion routine to load data from multiple excel files with multiple tabs and columns in a data structure using python. The structuring of the tabs in each of the excel files is the same. Can someone please help me with…

Harsh780
- 13
- 5
1
vote
2 answers
The document creation or update failed because of invalid reference
I am having trouble completing an excersice on the Microsoft Learn platform.
https://learn.microsoft.com/en-us/learn/modules/examine-components-of-modern-data-warehouse/5-exercise-azure-synapse
I have followed the instructions, but get the following…

BareAnders
- 27
- 4
1
vote
1 answer
Snowflake - Best practices to keep tables up to date with s3 external stage
We want to ingest our source tables from an s3 external stage into Snowflake.
For this ingestion we have to consider, new files arriving in the s3 bucket, updates in existing files, and in some cases row deletions.
We are evaluating 3 approaches so…

Ioannis Agathangelos
- 11
- 1