0

The hdfs.write() command in rhdfs creates a file with a leading non-unicode character. The documentation doesn't describe the file type being written.

Steps to recreate. 1. Open R and initialize rhdfs

> ofile = hdfs.file("brian.txt", "w")
> hdfs.write("hi",ofile)
> hdfs.close(ofile)

Creates a file called "brian.txt" which I could expect contains a single string, "hi". But this reveals and extra character at the beginning.

> hdfs dfs -cat brian.txt
X
    hi

I have no idea what file type is created and rhdfs doesn't show any file type options. This makes the output very difficult to use.

Brian Dolan
  • 3,086
  • 2
  • 24
  • 35

2 Answers2

3

If you look at the hdfs.write function in the source code, you can see that it can take raw bytes instead of having R serialize it for you. So essentially you can do this for characters

ofile = hdfs.file("brian.txt", "w")
hdfs.write(charToRaw("hi", ofile))
hdfs.close(ofile)
Newton T.
  • 56
  • 2
1

Hadoop by default serializes the object if you directly create/write, hence you are seeing extra characters in the file. However this is not the case when you copy a text file from local onto hadoop using copyFromLocal.

Serialization is the process of converting structured objects into a byte stream. It is done basically for two purposes: 1) For transmission over a network(inter process communication). 2) For writing to persistent storage.

You can unserialize the hadoop object using below R Code:

hfile = hdfs.file("brian.txt", "r") # read from hdfs
file <- hdfs.read(hfile) 
file <- unserialize(file) # deserialize to remove special characters
hdfs.close(hfile)

If you are planning to create the file from R, however would not be reading through R then the workaround to avoid the special character would be to save the content to a local file and move the file to hdfs. Below is the R code:

# Set environment path and load library
Sys.setenv("HADOOP_CMD"="/usr/local/hadoop/bin/hadoop")
library(rhdfs)
hdfs.init()  # Initialize

text <- "Hi, This is a sample text."
SaveToLocalPath <- "/home/manohar/Temp/outfile.txt"
writeLines(text, SaveToLocalPath) # write content to local file
hdfs.put(SaveToLocalPath, "/tmp") # Copy file to hdfs
file.remove(SaveToLocalPath) # Delete from local
Manohar Swamynathan
  • 2,065
  • 21
  • 23
  • Thanks Manohar! I am looking for a way to read the text file created within R from outside of R without the inserted characters. If I do `copyToLocal` then `cat` I get the same error. Do you know of any workarounds for that? – Brian Dolan Jan 15 '15 at 13:40
  • Brian, I have edited the answer to add the workaround. Hope this helps. – Manohar Swamynathan Jan 15 '15 at 18:36
  • Thanks again Manohar! So you are not writing directly to HDFS, I see. I wish the authors would provide some insight into the file format. https://groups.google.com/forum/#!msg/rhadoop/586gjz5kja8/yTP_mxIRHkMJ Ultimately, the Revolution guys need to solve this, I suppose. – Brian Dolan Jan 19 '15 at 14:18
  • Yep, agree Brian. Workaround should be fine in the meantime. – Manohar Swamynathan Jan 19 '15 at 16:49