2

I was going through the examples for patent data in Hadoop in action. Could you please explain in detail about the data sets being used?

  1. The patent citation data set
    This data set contains two columns citing and cited patents. Citing column refers to the owner ID who submitted the patent? Cited column refers to the patent ID which forms the key to the second data set?

  2. The patent description data set
    There are number of fields in this data set. To form the mapping for this two datasets, is it citing or cited column from first data set that has corresponding key in the second dataset first column (patent)?

vefthym
  • 7,422
  • 6
  • 32
  • 58

2 Answers2

1

Lets clear up some terminology related to patents to begin with.

What is citation ?

Citations are documents that are linked together when one document mentions another as having related content

Refer to this link to understand more about patents :)

the "patent citation data set" -- This data set just mentions patent citations.

More like saying patent A uses patent B,C and D

“CITING”,”CITED”

3858241,956203

3858241,1324234

3858241,3398406

3858241,3557384

3858241,3634889

3858242,1515701

3858242,3319261

3858242,3668705

3858242,3707004

Copy pasted it from the book, so here patent number 3858242 cites (uses/refers to) 4 other patents, patent number 3858241 cites (uses/refers to) 5 other patents

the patent description data set -- is a bit like the master table, it just holds the data for each patent.

Hopefully that clears up a few things for you.

Sudarshan
  • 8,574
  • 11
  • 52
  • 74
0

I guess there was misunderstanding in solution of Top K records from HiA book, at section 4.7, which says: "Top K records—Change AttributeMax.py (or AttributeMax.php) to output the entire record rather than only the maximumvalue. Rewrite it such that the MapReduce job outputs the records with the top K values rather than only the maximum."

The input data set to be used is actually apat63_99.txt file, and the exercise asks for the records with the top K values (CLAIMS) rather than only the maximum. As AttributeMax.py described in listing 4.6 was giving records for maximum claims.

Elkhan Dadashov
  • 2,417
  • 1
  • 13
  • 5