6

My map function produces a

Key\tValue

Value = List(value1, value2, value3)

then my reduce function produces:

Key\tCSV-Line

Ex.


2323232-2322 fdsfs,sdfs,dfsfs,0,0,0,2,fsda,3,23,3,s,

2323555-22222 dfasd,sdfas,adfs,0,0,2,0,fasafa,2,23,s


Ex. RawData: 232342|@3423@|34343|sfasdfasdF|433443|Sfasfdas|324343 x 1000

Anyway I want to eliminate the key's at the beginning of that so my client can do a straight import into mysql. I have about 50 data files, my question is after it maps them once and the reducer starts does it need the key printed out with the value or can I just print the value?


More information:

Here this code might shine some better light on the situation

http://pastebin.ca/2410217

this is kinda what I plan to do.

Jake Steele
  • 498
  • 3
  • 14
  • Could you please rephrase your question?Do you want to emit only the values and not the keys?I'm sorry, I didn't quite get it. – Tariq Jun 27 '13 at 01:23
  • Yes thats exactly what I want haha, sorry for being so unclear. I just want to make sure when I use multiple servers on multiple data files that emitting only the values and not keys in the reduce.py wont break the whole operation – Jake Steele Jun 27 '13 at 16:20

2 Answers2

13

If you do not want to emit the key set it to NullWritable in your code. For example :

public static class TokenCounterReducer extends
            Reducer<Text, IntWritable, NullWritable, IntWritable> {
        public void reduce(Text key, Iterable<IntWritable> values,
                Context context) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable value : values) {
                sum += value.get();
            }
            context.write(NullWritable.get(), new IntWritable(sum));
//          context.write(key, new IntWritable(sum));
        }

Let me know if this is not what you need, i'll update the answer accordingly.

Tariq
  • 34,076
  • 8
  • 57
  • 79
  • Thanks for the response. I believe that is using C# or Java by the looks of it, I am currently using python. I will update my question with some code ot make it more obvious :D – Jake Steele Jun 27 '13 at 15:58
  • Added some code to maybe help me get this resolved :) http://pastebin.ca/2410217 Maybe this explains a little better what I am doing, and I want to know if it works haha – Jake Steele Jun 28 '13 at 00:10
2

Your reducer can emit a line without \t, or, in your case, just what you're calling the value. Unfortunately, hadoop streaming will interpret this as a key with a null value and automatically append a delimiter (\t by default) to the end of each line. You can change what this delimiter is but, when I played around with this, I could not get it to not append a delimiter. I don't remember the exact details but based on this (Hadoop: key and value are tab separated in the output file. how to do it semicolon-separated?) I think the property is mapred.textoutputformat.separator. My solution was to strip the \t at the end of each line as I pulled the file back:

hadoop fs -cat hadoopfile | perl -pe 's/\t$//' > destfile
Community
  • 1
  • 1
John Pickard
  • 308
  • 1
  • 8
  • My output looks correct as a CSV file, however you are saying that if my reducer outputs a value without a key it will just make a null key and wont work anyway? I pastied my code maybe you can take a look and see if it will work: http://pastebin.ca/2410217 – Jake Steele Jun 28 '13 at 00:08
  • See I don't output any key just the values (csv seperated), but you are saying that this will just cause emr to say its a null key and not work? – Jake Steele Jun 28 '13 at 00:10
  • EMR is Elastic Map Reduce from Amazon? I haven't used that. We're running a "vanilla" hadoop cluster which we submit jobs to and pull data down from. In that environment, if your reducer outputs a row that doesn't contain a delimiter, hadoop adds a delimiter to the end of the row. I think it is interpreting the row as a key with a null value. As far as running on EMR, I can't guess. Can you test it with a small dataset? On a side note, looks like you're adding an extra ',' to the end of each row. – John Pickard Jun 29 '13 at 15:30