Hadoop by default serializes the object if you directly create/write, hence you are seeing extra characters in the file. However this is not the case when you copy a text file from local onto hadoop using copyFromLocal
.
Serialization is the process of converting structured objects into a byte stream. It is done basically for two purposes:
1) For transmission over a network(inter process communication).
2) For writing to persistent storage.
You can unserialize the hadoop object using below R Code:
hfile = hdfs.file("brian.txt", "r") # read from hdfs
file <- hdfs.read(hfile)
file <- unserialize(file) # deserialize to remove special characters
hdfs.close(hfile)
If you are planning to create the file from R, however would not be reading through R then the workaround to avoid the special character would be to save the content to a local file and move the file to hdfs. Below is the R code:
# Set environment path and load library
Sys.setenv("HADOOP_CMD"="/usr/local/hadoop/bin/hadoop")
library(rhdfs)
hdfs.init() # Initialize
text <- "Hi, This is a sample text."
SaveToLocalPath <- "/home/manohar/Temp/outfile.txt"
writeLines(text, SaveToLocalPath) # write content to local file
hdfs.put(SaveToLocalPath, "/tmp") # Copy file to hdfs
file.remove(SaveToLocalPath) # Delete from local