"No filesystem found for scheme gs" when running dataflow in google cloud platform

Question

I am running my google dataflow job in Google Cloud Platform(GCP). When I run this job locally it worked well, but when running it on GCP, I got this error "java.lang.IllegalArgumentException: No filesystem found for scheme gs". I have access to that google cloud URI, I can upload my jar file to that URI and I can see some temporary file for my local job.

My Job id in GCP:

2019-08-08_21_47_27-162804342585245230 (beam version:2.12.0)

2019-08-09_16_41_15-11728697820819900062 (beam version:2.14.0)

I have tried beam version of 2.12.0 and 2.14.0, both of them have the same error.


java.lang.IllegalArgumentException: No filesystem found for scheme gs
    at org.apache.beam.sdk.io.FileSystems.getFileSystemInternal(FileSystems.java:456)
    at org.apache.beam.sdk.io.FileSystems.matchNewResource(FileSystems.java:526)
    at org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers.resolveTempLocation(BigQueryHelpers.java:689)
    at org.apache.beam.sdk.io.gcp.bigquery.BigQuerySourceBase.extractFiles(BigQuerySourceBase.java:125)
    at org.apache.beam.sdk.io.gcp.bigquery.BigQuerySourceBase.split(BigQuerySourceBase.java:148)
    at org.apache.beam.runners.dataflow.worker.WorkerCustomSources.splitAndValidate(WorkerCustomSources.java:284)
    at org.apache.beam.runners.dataflow.worker.WorkerCustomSources.performSplitTyped(WorkerCustomSources.java:206)
    at org.apache.beam.runners.dataflow.worker.WorkerCustomSources.performSplitWithApiLimit(WorkerCustomSources.java:190)
    at org.apache.beam.runners.dataflow.worker.WorkerCustomSources.performSplit(WorkerCustomSources.java:169)
    at org.apache.beam.runners.dataflow.worker.WorkerCustomSourceOperationExecutor.execute(WorkerCustomSourceOperationExecutor.java:78)
    at org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.executeWork(BatchDataflowWorker.java:412)
    at org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.doWork(BatchDataflowWorker.java:381)
    at org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.getAndPerformWork(BatchDataflowWorker.java:306)
    at org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.doWork(DataflowBatchWorkerHarness.java:135)
    at org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:115)
    at org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:102)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

score 6 · Accepted Answer · answered Sep 04 '19 at 18:00

This may be caused by a couple of issues if you build a "fat jar" that bundles all of your dependencies.

You must include the dependency org.apache.beam:google-cloud-platform-core to have the Beam GCS filesystem.
Inside your far jar, you must preserve the META-INF/services/org.apache.beam.sdk.io.FileSystemRegistrar file with a line org.apache.beam.sdk.extensions.gcp.storage.GcsFileSystemRegistrar. You can find this file in the jar from step 1. You will probably have many files with the same name in your dependencies, registering different Beam filesystems. You need to configure maven or gradle to combine these as part of your build or they will overwrite each other and not work properly.

Yes building fat jar incorrectly is exactly the problem. We have fixed this problem by using shadow plug in to build the fat jar. Thanks. — Ming Ming, Sep 05 '19 at 19:39
I got that to work by adding the transformer `` described here: https://maven.apache.org/plugins/maven-shade-plugin/examples/resource-transformers.html#ServicesResourceTransformer — nsandersen, Nov 02 '20 at 15:27

score 5 · Answer 2 · answered Jan 27 '21 at 14:21

5

There is also one more reason for this exception. Make sure you create pipeline (e.g. Pipeline.create(options)) before you try to access files.

answered Jan 27 '21 at 14:21

Raman

51
1
3

Luillyfe · Answer 3 · 2021-03-09T09:27:43.583

1

[GOLANG] In my case it was solved by applying the below imports for side-effects

import (
_ "github.com/apache/beam/sdks/go/pkg/beam/io/filesystem/gcs"
_ "github.com/apache/beam/sdks/go/pkg/beam/io/filesystem/local"
_ "github.com/apache/beam/sdks/go/pkg/beam/io/filesystem/memfs"
)

edited Mar 09 '21 at 09:27

answered Feb 22 '21 at 21:02

Luillyfe

6,183
8
36
46

guillaume blaquiere · Answer 4 · 2019-08-10T05:32:14.007

0

It's normal. On your computer, you are using internal file with your tests (/.... In Linux, c:... In Windows). However, Google cloud storage isn't a an internal file system (btw it's not a file system) and thus the "gs://" can't be interpreted.

Try TextIO.read.from(...).

You can use it for internal and external files like GCS .

However, I experienced an issue, months ago on Windows environment, when I developed in Windows. C: wasn't a known scheme (same error as yours). It's possible that works now (I'm no longer on Windows, I can't test). Else, you have this workaround pattern: set a variable in your config object and perform a test on it like:

If (environment config variable is local)
    p.apply(FileSystems.getFileSystemInternal...);
Else 
    p.apply(TextIO.read.from(...));

edited Aug 10 '19 at 05:32

answered Aug 10 '19 at 05:18

guillaume blaquiere

66,369
2
47
76

Thanks for your answer, but for google dataflow, a google cloud tempLocation has to be used even when running local test, `pipeline.getOptions().setTempLocation("gs://xxxx")` and it works fine locally. We are seeing the temp files for local jobs in "gs://" folder. – Ming Ming Aug 11 '19 at 21:19
You means that when you run it locally, you use gs:// and that works? And that don't work on dataflow? – guillaume blaquiere Aug 11 '19 at 21:25
Yes when I run my dataflow pipeline locally, I use gs:// and that works(because my dataflow pipeline read from and write to BQ, I need a temporary location in google storage). But when I run it on google platform , I got this " No filesystem found for scheme gs" error. – Ming Ming Aug 12 '19 at 16:49
Just to be sure to not mix up, the temporary location is in GCS, thus you pass this param at your runner. However, about the file URL passed in the `FileSystems.getFileSystemInternal()` method, is it also a gs:// file prefix in your local environment ? – guillaume blaquiere Aug 12 '19 at 20:19
FileSystems.getFileSystemInternal() is build in Beam sdks library. I didn't handle it. But yes I am using "gs://xxxx" as tempLocation in local environment. – Ming Ming Aug 12 '19 at 22:51
Is it possible that 'beam-sdks-java-extensions-google-cloud-platform-core-.jar' which contains 'GcsFileSystemRegistrar' not available your CLASSPATH somehow when running using Dataflow runner ? `Filesystem`s are loaded by looking for `FileSystemRegistrar` implementations in the CLASSPATH (at pipeline submission and when starting Dataflow workers). All classes/jars in your local CLASSPATH get staged by Dataflow. – chamikara Aug 13 '19 at 00:54
@MingMing could you provide part of code that create the issue ? Anonymise and with only the relevant part. – guillaume blaquiere Aug 13 '19 at 19:29
@chamikara Yes, that's probably the reason, we are trying to build Uber jar using shadow plugin, hope this will fix the issue. Thanks. – Ming Ming Aug 13 '19 at 22:43
@guillaumeblaquiere Code that created the are in google dataflow library. ``` static FileSystem getFileSystemInternal(String scheme) { String lowerCaseScheme = scheme.toLowerCase(); Map schemeToFileSystem = (Map)SCHEME_TO_FILESYSTEM.get(); FileSystem rval = (FileSystem)schemeToFileSystem.get(lowerCaseScheme); if (rval == null) { throw new IllegalArgumentException("No filesystem found for scheme " + scheme); } else { return rval; } }``` – Ming Ming Aug 13 '19 at 22:45
@chamikara We use Shadow plugin to make a Uber jar and problem got solved. Thanks. – Ming Ming Aug 14 '19 at 00:08

"No filesystem found for scheme gs" when running dataflow in google cloud platform

4 Answers4

Linked