Questions tagged [data-ingestion]
248 questions
2
votes
1 answer
How to have fault tolerance on producer end with Kafka
I am new to Kafa and data ingestion. I know Kafka is fault tolerant, as it keeps the data redundantly on multiple nodes. However, what I don't understand is how can we achieve fault tolerance on the source/producer end. For example, If I have netcat…

MetallicPriest
- 29,191
- 52
- 200
- 356
2
votes
2 answers
Data Lake: fix corrupted files on Ingestion vs ETL
Objective
I'm building datalake, the general flow looks like Nifi -> Storage -> ETL -> Storage -> Data Warehouse.
The general rule for Data Lake sounds like no pre-processing on ingestion stage. All ongoing processing should happen at ETL, so you…

VB_
- 45,112
- 42
- 145
- 293
2
votes
1 answer
Databricks Ingest use cases
I've just found a new Databricks feature called Databricks Data Ingestion. There is very little material about it at this point.
When I should use Databricks Data Ingestion instead of existing mature tools like Azure Data Factory (ADF) or Apache…

VB_
- 45,112
- 42
- 145
- 293
2
votes
0 answers
How to ingest .doc / .docx files in elasticsearch?
I'm trying to index word documents in my elasticsearch environment. I tried using the elasticsearch ingest-attachment plugin, but it seems like it's only possible to ingest base64 encoded data.
My goal is to index whole directories with word files.…

xTheProgrammer
- 74
- 10
2
votes
2 answers
What is a good approach to ingest batch data passively (client push) using Google Cloud?
I'm implementing my first pipeline for "automated" data ingestion in my company. Our client doesn't want to let us make any call in their database (even create a replica,etc). The best solution I have thought until now is an endpoint (let them push…

Eduardo Humberto
- 425
- 2
- 5
- 16
2
votes
0 answers
Kinesis producers for web apps
I was playing around with Kinesis data streams, and was wondering how a web app or mobile app would send events to kinesis data streams. One way of doing this would be to set up a java spring based endpoint that the web app or mobile apps would post…

Yogi Nadkarni
- 21
- 3
2
votes
2 answers
How can a relational database with foreign key constraints ingest data that may be in the wrong order?
The database is ingesting data from a stream, and all the rows needed to satisfy a foreign key constraint may be late or never arrive.
This can likely be accomplished by using another datastore, one without foreign key constraints, and then when all…
user377628
2
votes
1 answer
How to load data incrementally using Sqoop with Avro as a data file?
Getting below error:
--incremental lastmodified cannot be used in conjunction with --as-avrodatafile.
when running command:
gcloud dataproc jobs submit hadoop \
--project='aca-ingest-dev' \
--cluster='sqoop-gcp-ingest-d3' \
…

DataVishesh
- 197
- 1
- 5
2
votes
1 answer
Sqoop Job Failing via Dataproc
I have submitted Sqoop job via GCP Dataproc Cluster and set it --as-avrodatafile configuration argument, but it is failing with below error:
/08/12 22:34:34 INFO impl.YarnClientImpl: Submitted application application_1565634426340_0021
19/08/12…

DataVishesh
- 197
- 1
- 5
2
votes
4 answers
Ingesting Google Analytics data into S3 or Redshift
I am looking for options to ingest Google Analytics data(historical data as well) into Redshift. Any suggestions regarding tools, API's are welcomed. I searched online and found out Stitch as one of the ETL tools, help me know better about this…

Prajakta Yerpude
- 215
- 6
- 20
2
votes
0 answers
Data Ingestion to Hadoop/Hive: Spring batch v.s. Sqoop
I am trying to ingest data from external relational database to Hive/HDFS and then do data processing and transformation. Is there a way to integrate Sqoop with Spring batch for data ingestion?

candicetdh
- 71
- 7
2
votes
2 answers
How to load a large csv file, validate each row and process the data
I'm looking to validate each row of a csv file of more than 600 million rows and up to 30 columns (the solution must process several large csv file of that range).
Columns can be text, dates or amounts. The csv must be validated with 40 rules, some…

moun
- 69
- 1
- 6
2
votes
2 answers
Apache Nifi HBASE lookup
I am new on Apache Nifi
We create Nifi flow, which is consuming json data from kafka and the results are being sent to another kafka topic after enrichment. However HBase lookup does not return value of key. Instead it returns key, value pair like …

erkan.oktay
- 21
- 2
2
votes
2 answers
Python Multiprocessing Loop
I'm hoping to use multiprocessing to speed up a sluggish loop. However, from what I've seen of multiprocessing examples, I'm not sure if this sort of implementation is good practice, feasible or possible.
There are broadly two parts to the loop:…

Darius
- 43
- 1
- 7
2
votes
1 answer
Working with large files with a particular extension using directory scan operator
I have a 1GB+ size file coming to my directory from a MQ, this takes some time to completely transfer the file, but a file will be produced in that directory even if it is not a complete one.
I am afraid my directoryScan operator will pick up an…

Ankit Sahay
- 1,710
- 8
- 14