1

Please have a look at the following code:

wcmapper.php (mapper for hadoop streaming job)

#!/usr/bin/php
<?php
//sample mapper for hadoop streaming job
$word2count = array();

// input comes from STDIN (standard input)
while (($line = fgets(STDIN)) !== false) {
   // remove leading and trailing whitespace and lowercase
   $line = strtolower(trim($line));
   // split the line into words while removing any empty string
   $words = preg_split('/\W/', $line, 0, PREG_SPLIT_NO_EMPTY);
   // increase counters
   foreach ($words as $word) {
       $word2count[$word] += 1;
   }
}

// write the results to STDOUT (standard output)

foreach ($word2count as $word => $count) {
   // tab-delimited
   echo "$word\t$count\n";
}

?>

wcreducer.php (reducer script for sample hadoop job)

#!/usr/bin/php
<?php
//reducer script for sample hadoop job
$word2count = array();

// input comes from STDIN
while (($line = fgets(STDIN)) !== false) {
    // remove leading and trailing whitespace
    $line = trim($line);
    // parse the input we got from mapper.php
    list($word, $count) = explode("\t", $line);
    // convert count (currently a string) to int
    $count = intval($count);
    // sum counts
    if ($count > 0) $word2count[$word] += $count;
}

ksort($word2count);  // sort the words alphabetically

// write the results to STDOUT (standard output)
foreach ($word2count as $word => $count) {
    echo "$word\t$count\n";
}

?>

This code is for Wordcount streaming job using PHP on commoncrawl dataset.

In here, these code read the entire input. This is not what I need, I need to read the first 100 lines and write them into a text file. I am a beginner in Hadoop, CommonCrawl and PHP. So, how can I do this?

Please help.

Dongle
  • 602
  • 1
  • 8
  • 18

2 Answers2

1

Use a counter in the first loop, and stop the loop when the counter reaches 100. Then, have a dummy loop that just reads until the end of the input, and then continue with your code (write the results to STDOUT). The writing of the results could also go before the dummy loop to read until the end of the STDIN input. Sample code follows:

...
// input comes from STDIN (standard input)
for ($i=1; $i<=100; $i++){
   // read the line from STDIN; you
   // can add a check to exit if done ($line == false)
   $line = fgets(STDIN); 
   // remove leading and trailing whitespace and lowercase
   $line = strtolower(trim($line));
   // split the line into words while removing any empty string
   $words = preg_split('/\W/', $line, 0, PREG_SPLIT_NO_EMPTY);
   // increase counters
   foreach ($words as $word) {
       $word2count[$word] += 1;
   }
}

// write the results to STDOUT (standard output)
foreach ($word2count as $word => $count) {
   // tab-delimited
   echo "$word\t$count\n";
}

// Dummy loop (to consume all the mapper input; it may work
// without this loop but I am not sure if this will confuse the
// Hadoop framework; you can try it without this loop and see)
while (($line = fgets(STDIN)) !== false) {
}
cabad
  • 4,555
  • 1
  • 20
  • 33
  • wow, this worked. Can you please tell me how to take output into a text file as well? Please help. – Dongle Dec 31 '13 at 20:36
  • @Dongle For your second question, I am not sure what you are asking, since your code will write the output to a text file (in HDFS). Please post another question with more details on what you want to achieve, what you've tried, and why it doesn't work. – cabad Dec 31 '13 at 20:40
  • Really? But where is this text file? I can't find it. – Dongle Dec 31 '13 at 22:06
  • how can I write it as a .txt? – Dongle Dec 31 '13 at 22:11
  • @Dongle This is why I suggested you post another question. We need more info. to answer this question. How do you run your program? The "-output" flag tells Hadoop Streaming where to store the output. The output is already a text file, regardless of the extension you use. But you can do "-output /path/in/hdfs/out.txt" if you wish. – cabad Dec 31 '13 at 22:12
0

I'm not sure how you define "lines" but if you wanted words you could do something like this:

for ($count=0; $count<=100; $count++){
      echo $word2count[$count]\t$count\n";
}
Jesse Clark
  • 133
  • 7