6

In my Java application I have an implementation for a file-system layer, where my file class is a wrapper for Hadoop filesystem methods. I am upgrading the from hadoop3-1.9.17 to hadoop3-2.2.8 and I am using the shaded jar of the new version.

My File class which has methods like write, read ...etc

class File {
    private String path;
    private FileSystem fs;

}

Here is how my write method is implemented

@Override
public OutputStream write(boolean overwriteIfExists) throws IOException {
    return fs.create(path, overwriteIfExists);
}

And my read method:

 @Override
 public InputStream read() throws IOException {
     return fs.open(path);
 }

I have a performance test case which I run on the above fileSystem implementation which uses org.apache.hadoop.fs.FileSystem the test runs creates many threads, each thread creates an instance of File class which has a specific path (i.e gs://some-bucket/objectX) and each thread run same operation, read, rename, checkExists..etc.

I ran same tests several time on both versions of the Hadoop connectors and the new [2.2.8] is showing overall slower execution time (almost 2X the old connector time).

Below is a comparison between the average execution time for each operation while using each connector version:

operation, hadoop3-1.9.17, hadoop3-2.2.8
READ       4542.71,        10171.26, (X2 old)
RENAME     1347.75,        4483.27,  (X4 old)
EXISTS     47.23,          1538.74,  (X50 old)
CREATE     570.1,          1539.81,  (X3 old)

I have checked this github issue & tried to follow the recommendation to fine tune the performance using the configs/params but failed to find any improvement.

Is there any guidelines on parameter configurations to improve the above operations time?

Or might this performance issue be due to some incompatibility in my class-path jars? Even though I am using the shaded jar can other jars interfere?

Here is a list of jars I have in my class path:

  • gcs-connector-hadoop3-2.2.8-shaded.jar
  • google-extensions-0.7.1.jar
  • google-api-client-1.32.2.jar
  • google-http-client-apache-v2-1.40.1.jar
  • proto-google-common-protos-2.7.3.jar
  • google-http-client-1.41.8.jar
  • google-oauth-client-1.33.3.jar
  • google-http-client-jackson2-1.40.1.jar
  • grpc-google-cloud-storage-v2-2.2.2-alpha.jar
  • google-http-client-gson-1.41.8.jar
  • google-cloud-monitoring-1.82.0.jar
  • google-cloud-core-http-2.5.4.jar
  • proto-google-cloud-storage-v2-2.2.2-alpha.jar
  • google-api-client-jackson2-1.32.2.jar
  • google-api-services-iamcredentials-v1-rev20210326-1.32.1.jar
  • google-oauth-client-java6-1.27.0.jar
  • google-cloud-core-grpc-2.5.4.jar
  • google-http-client-appengine-1.34.2.jar
  • google-cloud-core-2.5.4.jar
  • google-auth-library-credentials-1.7.0.jar
  • google-cloud-storage-1.106.0.jar
  • proto-google-iam-v1-1.2.3.jar
  • google-api-services-storage-v1-rev20211018-1.32.1.jar
  • google-auth-library-oauth2-http-1.7.0.jar
  • proto-google-cloud-monitoring-v3-1.64.0.jar
  • grpc-services-1.43.2.jar
  • grpc-netty-shaded-1.43.2.jar
  • grpc-alts-1.43.2.jar
  • grpc-stub-1.43.2.jar
  • grpc-census-1.43.2.jar
  • grpc-protobuf-1.43.2.jar
  • grpc-api-1.43.2.jar
  • grpc-xds-1.43.2.jar
  • grpc-core-1.43.2.jar
  • grpc-protobuf-lite-1.43.2.jar
  • grpc-context-1.43.2.jar
  • opencensus-contrib-grpc-metrics-0.31.0.jar
  • grpc-auth-1.43.2.jar
  • gax-grpc-2.7.1.jar
  • grpc-grpclb-1.43.2.jar
  • api-common-2.1.4.jar
  • gax-2.7.1.jar
  • gax-httpjson-0.73.0.jar
  • util-2.2.8.jar
  • util-hadoop-hadoop3-2.2.8.jar
  • auto-value-annotations-1.9.jar
Selim Alawwa
  • 742
  • 1
  • 8
  • 19
  • How did you get these results? May you run test using `hadoop fs ...` commands? – Igor Dvorzhak Oct 18 '22 at 15:41
  • @IgorDvorzhak this is a java program, to use file.exist for example we do it this way: we use org.apache.hadoop.fs.FileSystem.getFileStatus(Path) and check if fileStatus not null – Selim Alawwa Oct 18 '22 at 16:04
  • I think that testing performance using generic `hadoop fs ...` commands to list, copy, rename, etc will be a good step - it will allow to determine whether issue lies in you Java code, or in the connector. Also, what Dataproc cluster image do you use? – Igor Dvorzhak Oct 18 '22 at 16:12
  • @IgorDvorzhak But `hadoop fs` uses the same FileSystem class mentioned – OneCricketeer Oct 18 '22 at 18:38
  • @IgorDvorzhak in this [link](https://github.com/GoogleCloudDataproc/hadoop-connectors/issues/891#issuecomment-1282959369) you can find more details on how I am using the hadoop FileSystem to implement my File class and how I am testing. There is no issue in my Java code, it's been the same implementation for some time, and for some reason performance is worse when I upgrade the GCS connector – Selim Alawwa Oct 18 '22 at 20:24
  • @OneCricketeer yes exactly the same classes. Any idea if this might be due to some difference in dependencies ? or any other reason? – Selim Alawwa Oct 18 '22 at 20:26
  • If you are running multi-thread benchmarks then this issue maybe caused by increased default parallelization in newer GCS connector versions. Try to set `fs.gs.status.parallel.enable=false` and `fs.gs.inputstream.fadvise=SEQUENTIAL` properties and see if it will make a difference. – Igor Dvorzhak Oct 19 '22 at 03:51
  • @IgorDvorzhak added these params but did not improve the performance. Another observation is with the upgrade that there are files generated unnecessarily with the prefix "_GCS_SYNCABLE_TEMPFILE_" and are not being cleaned up. Also we see many threads with the name "gcs-syncable-output-stream-cleanup-pool" running for too long. Any ideas? – Selim Alawwa Oct 19 '22 at 12:19
  • This is probably caused by the hflush/hsync functionality activated via `fs.gs.outputstream.type` property. May you revert it back to the default `fs.gs.outputstream.type=BASIC` value and re-run your test? – Igor Dvorzhak Oct 19 '22 at 17:19
  • @IgorDvorzhak Does the hadoop version makes a difference? I am using hadoop version 3.2.0 and I checked the gcs repo https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/v2.2.8/pom.xml and it mentions hadoop version 3.2.2 – Selim Alawwa Oct 19 '22 at 19:55
  • @IgorDvorzhak fs.gs.outputstream.type=BASIC actually made the time worse. Another thing I noticed is after upgrade, there is big difference in memory consumption and GC activities – Selim Alawwa Oct 19 '22 at 20:00
  • I think Hadoop versions is fine. High memory consumption maybe caused by change of the default value of `fs.gs.list.max.items.per.call` from `1000` to `5000`. May you try to set it back to 1000 via `fs.gs.list.max.items.per.call=1000` property? Also, can you profile your test and check what actually consumes memory? – Igor Dvorzhak Oct 19 '22 at 20:27
  • @IgorDvorzhak I set fs.gs.list.max.items.per.call=1000 but still time is the same especially for rename, create and check exists operations. Read was improved by increasing "io.file.buffer.size". Any ideas what other params can improve performance especially for the (rename, check exists and create object) – Selim Alawwa Oct 19 '22 at 20:42
  • @IgorDvorzhak memory is mostly consumed by 64mbs byte arrays/buffers – Selim Alawwa Oct 19 '22 at 21:09
  • @IgorDvorzhak yes exactly these 64mbs buffers. Is there a way to improve this memory usage? Also what about performance any other recommendations on params that can improve performance of rename, create & checkIfExists operations? Based on changes between 1.9.17 to 3.2.8 what are default params / new features added that we should take into account? – Selim Alawwa Oct 19 '22 at 21:25
  • Interesting, if these are 64MiB buffers, than this must be GCS object writes, you can decrease buffer allocation per-object write to 8Mib using `fs.gs.outputstream.upload.chunk.size=8388608` property. You can find all available configuration properties in https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/v2.2.8/gcs/CONFIGURATION.md – Igor Dvorzhak Oct 19 '22 at 21:28

0 Answers0