Questions tagged [data-pipeline]
168 questions
0
votes
0 answers
Tensorflow Data API: prefetch when training on CPU?
I'm training on a MacBook pro (no GPU). After watching https://www.youtube.com/watch?v=SxOsJPaxHME I saw that I could improve my input pipeline by reordering some ops and adding num_parallel_calls.
Question:
Since I'm not using a GPU, does it make…

rodrigo-silveira
- 12,607
- 11
- 69
- 123
0
votes
2 answers
Best approach to automate archiving aws-redshift table
I have a big table in redshift I need to automate the process of archiving monthly data.
The current approach is as follows(Manual):
unload the redshift query result to s3
create new backup table
copy files from s3 to redshift table
remove…

darekarsam
- 181
- 2
- 12
0
votes
1 answer
What's the difference between task and job in airflow
there
In the airflow meta database, a table named job, and there are lots of records in there. I know the difference between DAGRun and task, but what's the difference between task and job in airflow?
Thanks in advance.

Bruce Yang
- 367
- 1
- 5
- 17
0
votes
0 answers
Problems with image recognition using Tensorflow for Google Streetview dataset
Tensorboard graph
I'm running this code to classify house numbers from Google Streetview, it runs and then goes into some sort of loop in the first step, I don't know why.
I have narrowed down the problem to be from the input pipeline.
def…

Karthik Arcot
- 49
- 4
0
votes
2 answers
Is there a way to continuously pipe data from Azure Blob into BigQuery?
I have a bunch of files in Azure Blob storage and it's constantly getting new ones. I was wondering if there is a way for me to first take all the data I have in Blob and move it over to BigQuery and then keep a script or some job running so that…

Michael
- 23
- 6
0
votes
0 answers
How to move compressed TSV files from Google Cloud Bucket to Big Query with auto detect schema?
I have been trying multiple ways to move the compressed TSV to Big query. I was able to get the command working but didn't see any table being loaded. Please help me figure out to write the command that works.
bq ‘--project_id’ --nosync load…

Ramya
- 21
- 4
0
votes
1 answer
Airflow DAGs schedule for the future
I'm trying to figure out how to configure/schedule an Airflow DAG to run twice a day in the exact time instead of run both times at the same time once the criteria is met.
I want to run the same task at midnight and 9pm.
To do so I've added a cron…

Hugo Sousa
- 906
- 2
- 9
- 27
0
votes
0 answers
Dynamodb import from s3 using emr
I am trying to load json data from s3 bucket to dynamodb using emr. I uploaded data successfully but my count is mismatching without throwing any error.
Why could that happen?

kamal kishore
- 84
- 1
- 8
0
votes
1 answer
AWS DynamoDB - Data Pipeline real write capacity consumption
I've created a data pipeline that pull data from S3 and push it into DynamoDB.
The pipeline started to run successfully.
I've set the write capacity to 20000 units, after few hours the writing decreased in a half, now it's still running with a write…

Anna
- 9
- 4
0
votes
1 answer
Different tools available for creating data pipelines
I need to create data pipelines in hadoop. I have data import, export, scripts to clean data set up and need to set it up in a pipeline now.
I have been using Oozie for data import and export schedules but now need to integrate R scripts for data…

simo kaur
- 39
- 1
- 9
-1
votes
1 answer
Design ideas for importing multiple customer csv files into a transactional database
I am working on redesigning a data pipeline that is responsible for importing customer data in CSV format from cloud buckets that customers own(We have the connection details already) into a transactional database that we own.
Constraints:
We…

Amit Tikoo
- 167
- 6
-1
votes
2 answers
Design - Best practice to ingest data from different sources with different interfaces
PROBLEM DESCRIPTION
Hello, I would like to implement a service that receives data from various providers and dumps it into a database (a sort of raw data store).
The issue is that the providers have different ways to give me the data I need. Some…

giulio di zio
- 171
- 1
- 11
-1
votes
1 answer
How to load .npy File in a tensorflow pipeline with tf.data
I'm trying to read my X and y Data from .npy files with np.load() in a tf.data pipeline. But get the following error if i call model.fit(). Have someone a soloution for that problem? I thought i have to give the shape of X_data and y_data to the…
-1
votes
1 answer
Google Dataflow pipeline for varying schema
I have a product to define and configure business workflows. A part of this product is a form-builder which enables users to setup different forms.
This entire forms data is backed on MongoDB in the following structure
- form_schemas
{
"_id" :…

Sharath Chandra
- 654
- 8
- 26
-1
votes
1 answer
Build an end-to-end data analysis platform
I need to create an end-to-end platform:
Input data collection and storage - Data will be periodically collected via FTP and stored in cloud.
Data Analysis - The data will be analyzed (using Tableau/ any other analytics software)
Reports - Daily…

priya
- 73
- 1
- 1
- 9