Questions tagged [data-ingestion]

248 questions
2
votes
1 answer

How to have fault tolerance on producer end with Kafka

I am new to Kafa and data ingestion. I know Kafka is fault tolerant, as it keeps the data redundantly on multiple nodes. However, what I don't understand is how can we achieve fault tolerance on the source/producer end. For example, If I have netcat…
MetallicPriest
  • 29,191
  • 52
  • 200
  • 356
2
votes
2 answers

Data Lake: fix corrupted files on Ingestion vs ETL

Objective I'm building datalake, the general flow looks like Nifi -> Storage -> ETL -> Storage -> Data Warehouse. The general rule for Data Lake sounds like no pre-processing on ingestion stage. All ongoing processing should happen at ETL, so you…
VB_
  • 45,112
  • 42
  • 145
  • 293
2
votes
1 answer

Databricks Ingest use cases

I've just found a new Databricks feature called Databricks Data Ingestion. There is very little material about it at this point. When I should use Databricks Data Ingestion instead of existing mature tools like Azure Data Factory (ADF) or Apache…
VB_
  • 45,112
  • 42
  • 145
  • 293
2
votes
0 answers

How to ingest .doc / .docx files in elasticsearch?

I'm trying to index word documents in my elasticsearch environment. I tried using the elasticsearch ingest-attachment plugin, but it seems like it's only possible to ingest base64 encoded data. My goal is to index whole directories with word files.…
2
votes
2 answers

What is a good approach to ingest batch data passively (client push) using Google Cloud?

I'm implementing my first pipeline for "automated" data ingestion in my company. Our client doesn't want to let us make any call in their database (even create a replica,etc). The best solution I have thought until now is an endpoint (let them push…
2
votes
0 answers

Kinesis producers for web apps

I was playing around with Kinesis data streams, and was wondering how a web app or mobile app would send events to kinesis data streams. One way of doing this would be to set up a java spring based endpoint that the web app or mobile apps would post…
2
votes
2 answers

How can a relational database with foreign key constraints ingest data that may be in the wrong order?

The database is ingesting data from a stream, and all the rows needed to satisfy a foreign key constraint may be late or never arrive. This can likely be accomplished by using another datastore, one without foreign key constraints, and then when all…
user377628
2
votes
1 answer

How to load data incrementally using Sqoop with Avro as a data file?

Getting below error: --incremental lastmodified cannot be used in conjunction with --as-avrodatafile. when running command: gcloud dataproc jobs submit hadoop \ --project='aca-ingest-dev' \ --cluster='sqoop-gcp-ingest-d3' \ …
DataVishesh
  • 197
  • 1
  • 5
2
votes
1 answer

Sqoop Job Failing via Dataproc

I have submitted Sqoop job via GCP Dataproc Cluster and set it --as-avrodatafile configuration argument, but it is failing with below error: /08/12 22:34:34 INFO impl.YarnClientImpl: Submitted application application_1565634426340_0021 19/08/12…
2
votes
4 answers

Ingesting Google Analytics data into S3 or Redshift

I am looking for options to ingest Google Analytics data(historical data as well) into Redshift. Any suggestions regarding tools, API's are welcomed. I searched online and found out Stitch as one of the ETL tools, help me know better about this…
2
votes
0 answers

Data Ingestion to Hadoop/Hive: Spring batch v.s. Sqoop

I am trying to ingest data from external relational database to Hive/HDFS and then do data processing and transformation. Is there a way to integrate Sqoop with Spring batch for data ingestion?
candicetdh
  • 71
  • 7
2
votes
2 answers

How to load a large csv file, validate each row and process the data

I'm looking to validate each row of a csv file of more than 600 million rows and up to 30 columns (the solution must process several large csv file of that range). Columns can be text, dates or amounts. The csv must be validated with 40 rules, some…
moun
  • 69
  • 1
  • 6
2
votes
2 answers

Apache Nifi HBASE lookup

I am new on Apache Nifi We create Nifi flow, which is consuming json data from kafka and the results are being sent to another kafka topic after enrichment. However HBase lookup does not return value of key. Instead it returns key, value pair like …
2
votes
2 answers

Python Multiprocessing Loop

I'm hoping to use multiprocessing to speed up a sluggish loop. However, from what I've seen of multiprocessing examples, I'm not sure if this sort of implementation is good practice, feasible or possible. There are broadly two parts to the loop:…
2
votes
1 answer

Working with large files with a particular extension using directory scan operator

I have a 1GB+ size file coming to my directory from a MQ, this takes some time to completely transfer the file, but a file will be produced in that directory even if it is not a complete one. I am afraid my directoryScan operator will pick up an…
Ankit Sahay
  • 1,710
  • 8
  • 14
1 2
3
16 17