I am brand new to GCP and Cloud Data Fusion. I see that you can use this service to integrate data across data sources into a data lake.
I have a number of sftp providers offering files in different structured formats eg. csv, json, parquet, and avro
Ultimately I'd like this data to be available in BQ.
Before loading to BQ my first stop was going to be Google Cloud Storage, that way I have an immutable copy of the data.
The sftp site will have multiple files representing multiple tables.
/root/table_1
/root/table_2
/root/table_3
I'm first trying to see if I use a Cloud Data Fusion pipeline to copy the files from SFTP to GCS. This has proven to be challenging.
- Can I use Fusion for this?
- Do I need to provide the schema for each file or can it be inferred?
- Do I need to manually enumerate every table? Ideally I'd like to copy all the files as is from SFTP to GCS
- Once in GCS I'd like to make an external data source in BigQuery for each file. Is that possible?