How do I go about using hadoop with active pivot?

Question

I am using version 5.1 of Active Pivot, but plan to upgrade to 5.2. I would like to read data in using the CsvSource and receive realtime updates.

score 2 · Answer 1 · answered Jul 16 '15 at 10:07

Introduction

This article explains few things on how to read data from Hadoop into Active Pivot. This was tested with Active Pivot 5.1 and 5.2. In short, you have 2 ways to fill the gap:

Using a mounted HDFS, this makes your HDFS similar to a disk
Using Hadoop Java API

Using a mounted HDFS

You can mount your HDFS easily with certain Hadoop distributions. (Ex: mounting HDFS with Cloudera CDH 5 was easy to do.)

After doing so you will have a mount point on your Active Pivot server linked to your HDFS and it will behave like a common disk. (At least for reading, writing has some limitations)

For instance if you had csv files on your HDFS, you would be able to directly use Active Pivot Csv Source.

Using Hadoop Java API

Another way is to use Hadoop Java API: http://hadoop.apache.org/docs/current/api/

Few main classes to use:

org.apache.hadoop.fs.FileSystem - Used for common operations with Hadoop.

org.apache.hadoop.conf.Configuration - Used to configure the FileSystem object.

org.apache.hadoop.hdfs.client.HdfsAdmin - Can be use to watch events (Ex: new file added to HDFS)

Note: Watching for events is available for Hadoop 2.6.0 and higher. For previous Hadoop you could either build your own or use a mounted HDFS with an existing FileWatcher.

Dependencies

You will need few Hadoop dependencies.

Beware there can be conflicts between Hadoop dependencies and Active Pivot ones on Jaxb.

In the following pom.xml, a solution to this was to exclude Jaxb dependencies from Hadoop dependencies.

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-hdfs</artifactId>
    <version>2.6.0</version>
</dependency>
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-auth</artifactId>
    <version>2.6.0</version>
</dependency>
<!-- These 2 dependencies have conflicts with ActivePivotCva on Jaxb -->
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-common</artifactId>
    <version>2.6.0</version>
    <exclusions>
        <exclusion>
            <groupId>com.sun.xml.bind</groupId>
            <artifactId>jaxb-impl</artifactId>
        </exclusion>
        <exclusion>
        <groupId>javax.xml.bind</groupId>
        <artifactId>jaxb-api</artifactId>
        </exclusion>
    </exclusions>
</dependency>
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-core</artifactId>
    <version>1.2.1</version>
    <exclusions>
        <exclusion>
            <groupId>com.sun.xml.bind</groupId>
            <artifactId>jaxb-impl</artifactId>
        </exclusion>
        <exclusion>
            <groupId>javax.xml.bind</groupId>
            <artifactId>jaxb-api</artifactId>
        </exclusion>
    </exclusions>
</dependency>

Properties

You will need to define at least 2 properties:

Hadoop address (Ex: hdfs://localhost:9000)
HDFS path to your files (Ex: /user/quartetfs/data/)

If your cluster is secured then you will need to figure out how to access it remotely in a secured way.

Example of reading a file from Hadoop

// Configuring

Configuration conf = new Configuration();
conf.set("fs.defaultFS", "hdfs://localhost:9000");

FileSystem hdfs = FileSystem.get(this.conf);
Path filePath = new Path(/user/username/input/file.txt);

// Reading
BufferedReader bfr =new BufferedReader(new InputStreamReader(hdfs.open(filePath)));
String str = null;

while ((str = bfr.readLine()) != null)
{
       System.out.println(str);
}

Hadoop Source

When you are able to read from your HDFS you can now write your Hadoop Source as you would for other sources.

For instance you could create a HadoopSource implementing ISource.

And you could start it in your SourceConfig where you would retrieve your properties from your environment.

Watching for events (Ex: new files)

If you want to retrieve files as they are stored on HDFS you can create another class watching for events.

An example would be the following code in which you would have your own methods handling certain events. (Ex in the following code: onCreation(), onAppend())

protected HdfsAdmin admin;
protected String threadName;

public void run()
{
    DFSInotifyEventInputStream eventStream;

    try
    {
      eventStream = admin.getInotifyEventStream();
      LOGGER.info(" - Thread: " + this.threadName + "Starting catching events.");

      while (true)
      {

        try
        {
          Event event = eventStream.take();

          // Possible eventType: CREATE, APPEND, CLOSE, RENAME, METADATA, UNLINK
          switch (event.getEventType())
          {
          case CREATE:
            CreateEvent createEvent = (CreateEvent) event;
            onCreation(createEvent.getPath());
            break;

          case APPEND:
            AppendEvent appendEvent = (AppendEvent) event;
            onAppend(appendEvent.getPath());
            break;

          default:
            break;
          }

        } catch (InterruptedException e) {
          e.printStackTrace();

        } catch (MissingEventsException e) {
          e.printStackTrace();
        }
      }
    } catch (IOException e1) {
      LOGGER.severe(" - Thread: " + this.threadName + "Failure to start the eventStream");
      e1.printStackTrace();
    }
  }

What I did for my onCreation method (not shown) was to store newly created files into a concurrent queue so my HadoopSource could retrieve several file in parallel.

-

If I have not been clear enough on certain aspects or if you have questions feel free to ask.

hi, question related to watching the new files: is it a notification that we receive is a file changed or created or do we have to poll ? can you add more details please. thnx — tuxmobil, Jul 17 '15 at 01:42
Hi, It is a polling mechanism. Here are more information: http://johnjianfang.blogspot.co.uk/2015/03/hdfs-6634-inotify-in-hdfs.html?m=1 — AurelienC, Jul 17 '15 at 11:14