0

I am having difficulties trying to get Pentaho PDI to access Hadoop. I did some research and found that Pentaho uses Adapters called Shims, I see these as connectors to Hadoop, the way that JDBC drivers are in Java world for database connectivity.

It seems that in the new version of PDI(v8.1), they have 4 Shims installed by default, they all seem to be specific distributions from the Big Data Companies like HortonWorks, MapR, Cloudera.

When I did further research on Pentaho PDI Big Data, in earlier versions, they had support for "Vanilla" installations of Apache Hadoop.

I just downloaded Apache Hadoop from the open source site, and installed it on Windows.

So my installation of Hadoop would be considered the "Vanilla" Hadoop installation.

But when I tried things out in PDI, I used the HortonWorks Shim, and when I tested things in terms of connection, it said that it did succeed to connect to Hadoop, BUT could not find the default directory and the root directory.

I have screen shots of the errors below:

enter image description here

enter image description here

So, one can see that the errors are coming from the access to directories, it seems: 1)User Home Directory Access 2) Root Directory Access

SO, since I am using the HortonWorks Shim, and i know that it has some default directories(I have used the HortonWorks Hadoop Virtual Machine before).

(1) My Question is: If i use HortonWorks Shim to connect to my "Vanilla" Hadoop installation, do i need to tweet some configuration file to set some default directories. (2) If I cannot use the HortonWorks Shim, how do i install a "Vanilla" Hadoop Shim?

Also I found this related post from year 2013 here on stackoverflow:

Unable to connect to HDFS using PDI step

Not sure how relevant this link of information is.

Hope someone that has experience with this can help out.

I forgot to add this additional information:

The core-site.xml file that i have for Hadoop, it's contents are this:

<configuration>
<property>
       <name>fs.defaultFS</name>
       <value>hdfs://localhost:9000</value>
   </property>
</configuration>

SO that covers it.

Palu
  • 668
  • 3
  • 11
  • 26
  • Hi, I have found something here on Stackoverflow: https://stackoverflow.com/questions/25043374/unable-to-connect-to-hdfs-using-pdi-step?rq=1 – Palu Sep 12 '18 at 00:12
  • you do need to get the right shim for the distro. However, before using pdi, make sure your commandline tools are working - so can you do "hadoop fs -ls" etc? And "yarn application -list" etc. If they're not working, then there's no chance PDI will work! once they're working, copy the conf into pdi, restart spoon and have another go. – Codek Sep 20 '18 at 15:58
  • Hi, everything is working with Hadoop in terms of command line, so that is not an issue, i can do ls, mkdir, move files etc. – Palu Sep 20 '18 at 16:00
  • In terms of Shim, the default shims in PDI are all the distros from the large Companies, they no longer have the "Vanila" type Hadoop shim, which they seemed to have a few years ago based on what I have seen in videos on youtube. – Palu Sep 20 '18 at 16:02
  • But as you can see based on my Screen shots, it seems that the HortonWorks shim, it does connect, its just the permissions to directories that seems to be the problem. – Palu Sep 20 '18 at 16:03
  • I am not certain which config file you are referring to that I would have to copy from Hadoop to PDI. – Palu Sep 20 '18 at 16:04

1 Answers1

0

Many times the lack of connection to the directory can be related to the user.

When using Hadoop with Pentaho because it is necessary that the user who runs the Pentaho is the same user who has the Hadoop cores.

For example if you have a user called jluciano on Hadoop, then you need to check a user on the system who uses the same name and then run the process in Pentaho, so the accesses to the directory will roll :).

Test it there and anything warns you.

Cristik
  • 30,989
  • 25
  • 91
  • 127