Questions tagged [data-pipeline]

168 questions
0
votes
2 answers

Jupyter notebooks as Kedro node

How can I use a Jupyter Notebook as a node in Kedro pipeline? This is different from converting functions from Jupyter Notebooks into Kedro nodes. What I want to do is using the full notebook as the node.
MCK
  • 11
0
votes
1 answer

Streamsets Data Collector: Replace a Field With Its Child Value

I have a data structure like this { "id": 926267, "updated_sequence": 2304899, "published_at": { "unix": 1589574240, "text": "2020-05-15 21:24:00 +0100", "iso_8601": "2020-05-15T20:24:00Z" }, "updated_at": { "unix":…
asrulsibaoel
  • 500
  • 1
  • 7
  • 14
0
votes
1 answer

Logical decoding - postgres - multiple output formats

I have been trying to build a pipeline using logical decoding of postgres. However, I am a little confused. Please find below the questions I have I have established a pub-sub and I can see the data flowing between the 2 servers. However, I haven't…
0
votes
1 answer

Access denied error while executing tensorflow example - https://www.tensorflow.org/tutorials/load_data/images

The link shows an example of data pipeline for images it works fine when I run directly on colab but when I use it on my laptop its gives this error. I've been using Keras for quite a while but this is the 1st time trying data pipelining and I…
0
votes
0 answers

How can I ingest data from a Microsoft SQL Server into Google Cloud Platform?

I've been reading the GCP documentation trying to find a way to ingest data from a Microsoft SQL Server database passively (like using Cloud SQL). The problem is that Cloud SQL keeps idle most of the time (data is updated once a week) and I could…
0
votes
1 answer

Cloud Composer/Airflow Task Runner Storage

I'm used to running pipelines via AWS data pipelines but getting familiar with Airflow (Cloud Composer). In data pipelines we would: Spawn a task runner, Bootstrap it, Do work, Kill the task runner. I just realized that my airflow runners are not…
0
votes
1 answer

How to Dynamically adding HTTP endpoint to load data into azure data lake by using Azure Data Factory and the REST api is cookie autheticated

I am trying to dynamically add/update linked service REST based on certain trigger/events to consume a RESP API to be authenticated using cookie which provides telemetry data. This telemetry data will be stored in Data Lake Gen2 and then will use…
0
votes
1 answer

Schema not merged properly with an AWS Glue crawler

I am currently building a datalake where I run AWS GlueJobs daily to copy data in our database and make them queryable via AWS Athena. Because the schema of the data I fetch changes often, I crawl them regularly with a Glue Crawler. Unfortunately,…
Robin Nicole
  • 646
  • 4
  • 17
0
votes
1 answer

Can I use Prometheus to list the files processing or already processed?

I need to know the time per service of an application, which is processing some files. So I mean: the same file passes through each service and I need to know each pipeline time. Is that possible with Prometheus and, for example, Grafana? Or there…
0
votes
2 answers

Cannot get AWS Data Pipeline connected to Redshift

I have a query I'd like to run regularly in Redshift. I've set up an AWS Data Pipeline for it. My problem is that I cannot figure out how to access Redshift. I keep getting "Unable to establish connection" errors. I have an Ec2Resource and I've…
ScottieB
  • 3,958
  • 6
  • 42
  • 60
0
votes
1 answer

Splitting file into small chunks and processing

I have three files and each contain close to 300k records. Have written a python script to process those files with some business logic and able to create the output file successfully. This process completes in 5 mins. I am using the same script to…
Hari
  • 51
  • 1
  • 5
0
votes
1 answer

How do I use tf.Dataset to load data into multiple GPUs?

Currently, I'm passing the data into multiple GPUs using get_next(). Is there a better way to feed data into multiple GPU s?
Illuminati0x5B
  • 602
  • 7
  • 24
0
votes
1 answer

Data pipeline - dumping large files from API responses into AWS then with final destination being on premises SQL Server

I'm new to building data pipelines where dumping files in the cloud is one or more steps in the data flow. Our goal is to store large, raw sets of data from various APIs in the cloud then only pull what we need (summaries of this raw data) and store…
0
votes
1 answer

Configure datapipeline to receive parameter values from a Lambda

I have a Lambda function that activates a datapipeline: client.activate_pipeline( pipelineId='df-0680373LNPNFF73UDDD', parameterValues=[{'id':'myVariable','stringValue':'ok'}]) How do I configure the datapipeline to receive the…
0
votes
1 answer

Make a generic/parametrize trigger in azure data factory

I want to load data from on premise to azure blobs. I have data on three on premise servers. Problem is that data copying should run at different time for each source. Please suggest a way to do that.