Questions tagged [data-ingestion]

248 questions
3
votes
1 answer

What is intermediate persist in Apache Druid?

How does Druid persist real time ingested data before it hands off to Deep storage? In the document, Druid has configuration about intermedatepersistperiod, and maxpendingpersists. But it doesn't say much about what is intermediate persist, how it…
Happy
  • 121
  • 1
  • 8
3
votes
0 answers

what's the difference between apache gobblin and spring-cloud-dataflow, how to choose?

As the official documentation Apache Gobblin is a universal data ingestion framework for extracting, transforming, and loading large volume of data from a variety of data sources, e.g., databases, rest APIs, FTP/SFTP servers, filers, etc., onto…
user3172755
  • 137
  • 1
  • 10
3
votes
1 answer

How disable base64 storing for ingest-attachment elasticsearch plugin?

The documentation shows example about how store base64 documents into elasticsearch via ingest-attachment plugin. But after this I got that elasticsearch index contains parsed text and base64 field source. Why does it needed? Is there a way to…
3
votes
1 answer

Suggested Hadoop-based Design / Component for Ingestion of Periodic REST API Calls

We are planning to use REST API calls to ingest data from an endpoint and store the data to HDFS. The REST calls are done in a periodic fashion (daily or maybe hourly). I've already done Twitter ingestion using Flume, but I don't think using Flume…
oikonomiyaki
  • 7,691
  • 15
  • 62
  • 101
2
votes
1 answer

Can users upload files into a S3 bucket without frontend experience or users having access to AWS account?

I am looking to create an AWS solution where a lambda function will transform some excel data from a S3 bucket. When thinking about how I'm going to create the architecture background, I need to think of a way where I can get non-technical users,…
2
votes
1 answer

Query Last Inserted or Last updated rows from Snowflake Table

I would like to know how can I query for rows which were created or updated on a given date without using any specific column to look up in the database table. Is there a way information_schema can provide us with row level insert/update datetime?
2
votes
3 answers

How can I ingest data from Apache Avro into the Azure Data Explorer?

for several days I'm trying to ingest Apache Avro formatted data from a blob storage into the Azure Data Explorer. I'm able to reference the toplevel JSON-keys like $.Body (see red underlined example in the screenshot below), but when it goes to the…
allrik
  • 43
  • 8
2
votes
2 answers

Using Airbyte to get data from websites/datasets platforms like kaggle

I am new to Airbyte, our team is looking to use airbyte for different sources - ranging from http api (web scraped website) to websites containing datasets like kaggle etc. we are looking to create custom connectors for these sources. I am looking…
2
votes
1 answer

Snowflake ingestion: Snowpipe/Stream/Tasks or External Tables/Stream/Tasks

For ingesting data from an external storage location into Snowflake when de-duping is necessary, I came across two ways: Option 1: Create a Snowpipe for the storage location (Azure container or S3 bucket) which is automatically triggered by event…
2
votes
0 answers

How to create a incremental connector on Airbyte?

I am evaluating Airbyte to ingest data from multiple sources, one of them is a servicesnow API, I developed a connector using Airbyte CDK. I am trying to implement incremental streams or slides to improve data recovery performance. since pulling…
Danieledu
  • 391
  • 1
  • 4
  • 19
2
votes
1 answer

How to get an AWS Feature Store feature group into the ACTIVE state?

I am trying to ingest some rows into a Feature Store on AWS using: feature_group.ingest(data_frame=df, max_workers=8, wait=True) but I am getting the following error: Failed to ingest row 1: An error occurred (ValidationError) when calling the…
2
votes
1 answer

Elasticsearch _id as MD5 hash or document fields

There are some examples available on the internet to customize _id field for a Elasticsearch document but is there a way to generate a composite _id of multiple fields. Sample Data { "first_name": "john", "last_name": "doe", "dob":…
Jugraj Singh
  • 529
  • 1
  • 6
  • 22
2
votes
1 answer

Azure Data Explorer High Ingestion Latency with Streaming

We are using stream ingestion from Event Hubs to Azure Data Explorer. The Documentation states the following: The streaming ingestion operation completes in under 10 seconds, and your data is immediately available for query after completion. I am…
Markus S.
  • 2,602
  • 13
  • 44
2
votes
0 answers

How to retrieve the cash dividends from the quantopian-quandl data bundle with Zipline?

https://www.zipline.io/bundles.html By default zipline comes with the quantopian-quandl data bundle which uses quandl’s WIKI dataset. The quandl data bundle includes daily pricing data, splits, cash dividends, and asset metadata. Quantopian has…
Vincent Roye
  • 2,751
  • 7
  • 33
  • 53
2
votes
1 answer

Refresh Data in druid

I am using the index_parallel native batch method to ingest data to Druid from s3. I have done the initial ingestion using Tasks tab from druid UI. I want to schedule another task to do delta ingestion daily. I have gone through a lot of…
unknown
  • 53
  • 2
  • 9
1
2
3
16 17