0

New to streamsets. Following the documentation tutorial, was getting

FileNotFound: ... HADOOPFS_14 ... (permission denied)

error when trying to set the destination location as a local FS directory and preview the pipeline (basically saying either the file can't be accessed or does not exist), yet the permissions for the directory in question are drwxrwxr-x. 2 mapr mapr. Eventually found workaround by setting the destination folder permissions to be publicly writable ($chmod o+w /path/to/dir). Yet, the user that started the sdc service (while I was following the installation instructions) should have had write permissions on that directory (was root).

I set the sdc user env. vars. to use the name "mapr" (the owner of the directories I'm trying to access), so why did I get rejected? What is happening here when I set the env. vars. for sdc (because it does not seem to be doing anything)?

This is a snippet of what my /opt/streamsets-datacollector/libexec/sdcd-env.sh file looks like:

# user that will run the data collector, it must exist in the system
#
export SDC_USER=mapr

# group of the user that will run the data collector, it must exist in the system
#
export SDC_GROUP=mapr

So my question is, what determines the permissions for the sdc service (which I assume is what is being used to access FS locations by the streamsets web UI)? Any explaination or links to specific documentation would be appreciated. Thanks.

lampShadesDrifter
  • 3,925
  • 8
  • 40
  • 102

1 Answers1

1

Looking at the command ps -ef | grep sdc to examine who the system thinks the owner of the sdc process really is, found that it was listed as:

sdc    36438  36216  2 09:04 ?    00:01:28 /usr/bin/java -classpath /opt/streamsets-datacollector

So it seems that editing sdcd-env.sh did not have any effect. What did work was editing the /usr/lib/systemd/system/sdc.service file to look like (notice that have set user and group to be the user that owns the directories to be used in the streamsets pipeline):

[Unit]
Description=StreamSets Data Collector (SDC)

[Service]
User=mapr
Group=mapr
LimitNOFILE=32768
Environment=SDC_CONF=/etc/sdc
Environment=SDC_HOME=/opt/streamsets-datacollector
Environment=SDC_LOG=/var/log/sdc
Environment=SDC_DATA=/var/lib/sdc
ExecStart=/opt/streamsets-datacollector/bin/streamsets dc -verbose
TimeoutSec=60

Then restarting the sdc service (with systemctl start sdc, on centos 7) showed:

mapr    157013 156955 83 10:38 ?    00:01:08 /usr/bin/java -classpath /opt/streamsets-datacollector...

and was able to validate and run pipelines with origins and destinations on local FS that are owned by the user and group set in the sdc.service file.

* NOTE: the specific directories used in the initial post are hadoop-mapr directories mounted via NFS (mapr 6.0) (though the fact that they are NFS should mean that this solution should apply generally) hosted on nodes running centos 7.

lampShadesDrifter
  • 3,925
  • 8
  • 40
  • 102