Questions tagged [data-pipeline]
168 questions
0
votes
1 answer
I can't log in to Apache nifi 1.16.3
I would like to create data pipline with Apach Nifi (for learning purpose) but After installed jdk-17.0.3.1_windows-x64_bin and downloaded Nifi 1.16.3. to my computer with Win10. I tried to check the generated username and password in…

Cof
- 47
- 4
0
votes
0 answers
Batch processing vs Stream processing
I'm a bit new to this world of batch vs stream processing and I'm going back and forth in making a call.
In my case, we have an ELT tool that runs jobs both periodically(with intervals varying from 5 mins to 1 year) as well as on-demand. And this…

nerd
- 15
- 3
0
votes
1 answer
Data Pipeline using Azure Data Factory or Azure Synapse
I am building a new data pipeline for our team. This data pipeline would collect data from multiple sources and ingest them into a single table. I am looking into a couple of options within Azure to achieve this (Synapse being the main option). I…

Perseus
- 1
0
votes
1 answer
Why PyTorch creates another data repro TorchData
Why PyTorch creates another repro called TorchData for similar/new Dataset and DataLoader instead of adding them in the existing PyTorch repro? What's the difference of Dataset and Datapipe? Thanks.

hsh
- 111
- 2
- 8
0
votes
1 answer
How can I auto populate several excel sheets from other Excel files
I am currently working on a power bi dashboard that uses an excel file as a data source.
I want to auto populate the excel file with new values from existing excel reports each day.
In the source file there are several sheets each with several…

maryam kouram
- 11
- 3
0
votes
0 answers
Architecture for tracking data changes in application DB required for warehousing
Overview
I have an OLTP DB that stores application transaction data and acts as the source of truth for the current state of the application. I would like to have my DWH store historical data so I can do analyses that compare previous states of the…

Scooter
- 1,031
- 1
- 14
- 33
0
votes
2 answers
What is the best strategy to store redis data to MySQL for permanent storage?
I am running a couple of crawlers that produce millions of datasets per day. The bottleneck is the latency between the spiders and the remote database. In case the location of the spider server is too large, the latency will slow the crawler down to…

merlin
- 2,717
- 3
- 29
- 59
0
votes
1 answer
Fivetran Shopify Connector: Would like to extract and load the raw data from Shopify - which schema destination is the right one?
Which scheme to choose here to get all the raw data from my Shopify store?
The problem is, there is no exact description of what schemes are available?

siul1008
- 1
0
votes
1 answer
Data Pipeline Solution
We have a use-case to build data pipeline solution in which we need following things:
Ability to have multiple steps (outputs from one step should feed as input to next)
Ability to have multiple algorithms (SQL Query or probably invoke REST…

Prakhar Awasthi
- 23
- 1
- 5
0
votes
0 answers
TPL BufferBlock Not Completing When Bounded Capacity is Greater Than 1
I currently have a pipeline setup as such:
BufferBlock ==> BatchBlock ==> TransformBlock ==> TransformManyBlock ==> BufferBlock
The BoundedCapacity is set to 2,000. However, if I provide a single item to the BufferBlock, and then call the Complete…

Dominick
- 322
- 1
- 3
- 16
0
votes
1 answer
How to implement Data pipeline with recurrent tasks?
I have to set up a data pipeline for an app I try to create but I am not sure how to do it.
I have 2 entities in the database: A and B, each entity B belong to an entity A.
Every minute, I fetch many B entities but one field is missing (on each B…

Vince M
- 890
- 1
- 9
- 21
0
votes
1 answer
How to apply a function to convert the paths to arrays using cv2 in tensorflow data pipeline?
Any help will be highly appreciated
I'm trying to load two lists containing image paths and their corresponding labels. Something like this:
p0 = ['a','b',....] #paths to images .tif format
p1 = [1,2,3,......] #paths to images .tif format
labels =…

Mainak Sen
- 63
- 6
0
votes
2 answers
Setup Datapipeline Flow in AWS
Problem Statement: We have a Postgres RDS (Managed by AWS), and there is a requirement to set up a data lake (In S3) for all the data that are there in RDS. The data should be pushed to s3 on a near real-time basis, the solution should also take…

Govind Kumar
- 129
- 1
- 8
0
votes
2 answers
Using global variable in Apache Beam and Google Dataflow
I've been stuck for a few days. So my problem is, i create data pipeline using apache beam and dataflow runner. I use a global variable (a dictionary) in the script to be accessed by some function. The problem is, when i run it in local with…

arroganthooman
- 15
- 5
0
votes
1 answer
Implement a custom coder for apache_beam Python SDK version > 2.24
i've been working on my data engineering stuff using apache_beam sdk for python. I used the 2.24 version. I have some issue with a custom coder class i created when upgrading the apache_beam version to 2.31. The custom coder class name is…

arroganthooman
- 15
- 5