Questions tagged [google-cloud-dataprep]

An intelligent cloud data service to visually explore, clean, and prepare data for analysis.

DataPrep (or more accurately Cloud Dataprep by Trifacta) is a visual data transformation tool built by Trifacta and offered as part of Google Cloud Platform.

It is capable of ingesting data from and writing data to several other Google services (BigQuery, Cloud Storage).

Data is transformed using recipes which are shown alongside a visual representation of the data. This allows the user to preview changes, profile columns and spot outliers and type mismatches.

When a DataPrep flow is run (either manually or scheduled), a DataFlow job is created to run the task. DataFlow is Google's managed Apache Beam service.

205 questions
1
vote
0 answers

Dynamic inputs for cloud function that launches dataflow job

I am implementing Cloud Functions to trigger DataPrep Dataflow job. I can do with a fixed table, and that works fine. When I try to give the table name inside the cloud function that changes over the time, I am getting the same result when the…
1
vote
1 answer

How to export Google Analytics data to a Google GCS bucket or to BigQuery?

Is there a way to export Google Analytics data to a Google GCS bucket or to BigQuery? I'm trying to use Google Dataprep to have a better look over data from analytics.
1
vote
1 answer

Is it possible to split a dataset in Google Dataprep? If so, how?

I've been looking into Google Dataprep as an ETL solution to perform some basic data transformation before feeding it to a machine learning platform. I'm wondering if it's possible to use the Dataprep/Dataflow tools to split a dataset into train,…
1
vote
0 answers

Could not load timestamp from dataprep to bigquery

I am scheduling a DataPrep job for aggregation purposes. The job just consolidates the raw data from Bigquery and put it back into another bigquery table. There is a timestamp column (which is used as the partitioning key for table partitioning) in…
hamedazhar
  • 990
  • 10
  • 26
1
vote
2 answers

BigQuery cant import the data from DataPrep

I have the table created in BigQuery with partitioned by date and it has the Date type. DataPrep also has the same column with same data type. When i try to load the data from dataprep to bigquery table i am getting the error like "The column…
Suresh M N
  • 99
  • 1
  • 1
  • 7
1
vote
1 answer

Google Cloud Dataprep: add file parameter metadata as column value

I want to ingest a GCS folder full of files which all have the date in the filename like /foo_2018-08-22.txt and /foo_2018-08-23.txt. For each file I'd like to add the date of the data (from the filename) as a value in a column such that all rows…
Jake Lowen
  • 899
  • 1
  • 11
  • 21
1
vote
1 answer

Google Dataprep: Save GCS file name as one of the column

I have a Dataprep flow configured. The Dataset is a GCS folder (all files from it). Target is BigQuery table. Since data is coming from multiple files, I want to have filename as of the columns in the resulting data. Is that possible?
1
vote
1 answer

google dataprep (clouddataprep by trifacta) tip: jobs will not be able to run if they are to large

During my cloud dataprep adventures I have come across yet another very annoying bug. The problem occurs when creating complex flow structures which need to be connected through reference datasets. If a certain limit is crossed in performing a…
B Delfos
  • 21
  • 2
1
vote
1 answer

Google DataPrep is extremely slow

In Google Dataflow, i have a job that basically looks like this: Dataset: 100 rows, 1 column. Recipe: 0 steps Output: New Table. But it takes between 6-8 minutes to run. What could be the issue?
1
vote
1 answer

DataFlow gcloud CLI - "Template metadata was too large"

I've honed my transformations in DataPrep, and am now trying to run the DataFlow job directly using gcloud CLI. I've exported my template and template metadata file, and am trying to run them using gcloud dataflow jobs run and passing in the input &…
1
vote
0 answers

Google Cloud DataPrep schedule is spawning multiple DataFlow jobs

I have a schedule which runs my flow twice a day - at 0910 and 1520 BST. This is spawning a massive number of DataFlow jobs - so far today just the second schedule (1520) has spawned 80 jobs: $ gcloud dataflow jobs list JOB_ID …
1
vote
1 answer

Google Cloud Dataprep can't import a BigQuery view across projects

I am working with Google Cloud Dataprep and cannot import a dataset from a Big Query view. The view lives in Project A (where the Dataprep was set up) and is a select across a set of wildcard tables which live in Project B. It fails with this error:…
1
vote
1 answer

How can I set up an automated import to Google Data Prep?

When using Google Data Prep, I am able to create automated schedules to run jobs that update my BigQuery tables. However, this seems pointless when considering that the data used in Prep is updated by manually dragging and dropping CSVs (or JSON,…
1
vote
1 answer

Programmatically edit Dataprep recipe

We have a dataprep job to process input file and produce a cleaned file. We are calling this dataprep job remotely using dataflow templates. We are using python to run a job from dataflow templates. Since we need to do this for different files, we…
1
vote
1 answer

Cloud Dataprep - Multiply rows in one column based on values in other column

I am working in Cloud Dataprep and i have a case like this: Basically I need to create new rows in column 2 based on how many rows there is with matching data in column 1. Is it possible and how?
zerina
  • 131
  • 1
  • 1
  • 4