0

I have 2 text files stored in Hadoop that I want to use to create a Graph in Apache Spark GraphX:

  1. A text file with Vertex information, including a GUID type String identifying each Vertex.
  2. A text file with Edge information, including two GUIDs type String linking the source and destination Vertex GUIDs.

I import these tables files into HCatalog tables so I can access these files from Spark using a HiveContext.

My understanding is that:

In order to proceed with my project I want to extend my 2 tables with additional columns, on basis of the GUID information, of type Long in order to implement VertexIDs in GraphX. Pig doesn't offer functions such as UUID.getMostSignificantBits() as in Java to convert a UUID/GUID to a type Long.

Piggybank UDF includes in the "evaluation" section an implementation of a function HashFNV. Although I am not an java developer I understand from the java source code that the function converts an input of type String and returns a hash of type Long. It also extends the input table to a new table with a column of DataType.LONG.

Questions:

  1. Is using Pig with the Piggybank jar executing the HashFNV function an usable and pragmatic way to generate VertexIds of type Long from an input table/file with GUID information?
  2. How do I call and use the HasFNV function within Pig after I registered the Piggybank jar? Can you provide example code?

Assumptions:

  • An unique GUID will result into an unique hash of type Long using HashFNV.
  • I do understand that a GUID representing 128 bits will not fit into a Long of 64 bits. However, the amount of GUIDs in the input file will not exceed the 64 bit space.
Luc
  • 223
  • 2
  • 13

1 Answers1

0

The answer is the following Pig script:

REGISTER piggybank.jar;
A = LOAD '/user/hue/guidfile.txt'
AS (guid:chararray, name:chararray, label:chararray);
B = FOREACH A GENERATE (guid, name, label, org.apache.pig.piggybank.evaluation.string.HashFNV(guid));
store B INTO '/user/hue/guidlongfile.txt';

The result includes an additional field of type Long.

The name and label fields are mentioned to indicate a Vertex type table with name and label attributes in addition to the GUID field. They have no function in the answer.

It looks like I have found a solution to generate VertexIds type Long from a String type GUID. I noticed that others who want to experiment with Apache Spark GraphX using their own data encounter the same issue.

If you want to copy the solution: be aware of the limited address space of a 64 bit type Long in comparison to a 128 bit GUID.

Luc
  • 223
  • 2
  • 13