0

I am trying to set up Tachyon on S3 filesystem. For HDFS, tachyon has a parameter called TACHYON_UNDERFS_HDFS_IMPL which is set to "org.apache.hadoop.hdfs.DistributedFileSystem". Does anyone know if such a parameter exists for S3? If so, what is its value?

Thanks in advance for any help!

dtolnay
  • 9,621
  • 5
  • 41
  • 62
user3033194
  • 1,775
  • 7
  • 42
  • 63

1 Answers1

1

Hadoop FS type you mentioned (org.apache.hadoop.hdfs.DistributedFileSystem) is just the interface and it fits your needs. Instead, Tachyon create the s3n FileSystem implementation basing on scheme specified in the uri of remote dfs which is configured with TACHYON_UNDERFS_ADDRESS. For Amazon, you will need to specify something like this:

export TACHYON_UNDERFS_ADDRESS=s3n://your_bucket

Note "s3n", not "s3" here.

Additional setup you will need to work with s3 (see also Error in setting up Tachyon on S3 under filesystem and http://tachyon-project.org/Setup-UFS.html):

  1. in ${TACHYON}/bin/tachyon-env.sh: add key id and the secret key to TACHYON_JAVA_OPTS:

    -Dfs.s3n.awsAccessKeyId=123
    -Dfs.s3n.awsSecretAccessKey=456 
    
  2. Publish extra dependencies required by s3n Hadoop FileSystem implementation, the version depends on the version of Hadoop installed. These are : commons-httpclients-* and jets3t-*. For that, publish the TACHYON_CLASSPATH as mentioned in one of links above. This can be done by adding export of TACHYON_CLASSPATH in ${TACHYON}/libexec/tachyon-config.sh before exporting CLASSPATH:

    export TACHYON_CLASSPATH=~/.m2/repository/commons-httpclient/commons-httpclient/3.1/commons-httpclient-3.1.jar:~/.m2/repository/net/java/dev/jets3t/jets3t/0.9.0/jets3t-0.9.0.jar
    
    export CLASSPATH="$TACHYON_CONF_DIR/:$TACHYON_JAR:$TACHYON_CLASSPATH":
    
  3. Start Tachyon cluster:

    ./bin/tachyon format
    ./bin/tachyon-start.sh local 
    

Check its availability via web interface: http://localhost:19999/

in logs:

    ${TACHYON}/logs
  1. Your core-site.xml should contain following sections to make sure you are integrated with Tachyon (see Spark reference http://tachyon-project.org/Running-Spark-on-Tachyon.html for configuration right from scala)

    • fs.defaultFS - specify the Tachyon master host-port (below are defaults)
    • fs.default.name - default name of fs, the same as before
    • fs.tachyon.impl - Tachyon's hadoop.FileSystem implementation hint
    • fs.s3n.awsAccessKeyId - Amazon key id
    • fs.s3n.awsSecretAccessKey - Amazon secret key

       <configuration>
         <property>
           <name>fs.defaultFS</name>
           <value>tachyon://localhost:19998</value>
         </property>
         <property>
           <name>fs.default.name</name>
           <value>tachyon://localhost:19998</value>
           <description>The name of the default file system.  A URI 
                        whose scheme and authority determine the  
                        FileSystem implementation.                    
           </description>
         </property>
         <property>
           <name>fs.tachyon.impl</name>
           <value>tachyon.hadoop.TFS</value>
         </property>
         ...
         <property>
           <name>fs.s3n.awsAccessKeyId</name>
           <value>123</value>
         </property>
         <property>
           <name>fs.s3n.awsSecretAccessKey</name>
           <value>345</value>
         </property>
         ...
       </configuration>
      
  2. Refer to any path using tachyon scheme and master host port:

    tachyon://master_host:master_port/path
    

    Example with default Tachyon master host-port:

    tachyon://localhost:19998/remote_dir/remote_file.csv
    
Community
  • 1
  • 1
Elena Viter
  • 514
  • 6
  • 12