0

I am analyzing log files with various domain names using Cascading. Here is an example of the output report after it has been filtered:

www.google.nl 3

www.google.it 3

www.google.com.co 3

www.google.com.hk 3

www.google.co.jp 3

I would like to group or combine all domains that have "google" on it as just 1 line. The output report will only have 1 single line for all google domains. Something like this:

www.google.com 15

or

google 15

Do you think this is possible? Any ideas?

ekad
  • 14,436
  • 26
  • 44
  • 46
cevallos.valtira
  • 191
  • 1
  • 1
  • 8

2 Answers2

0

As long as you understand how to set up taps and tie them to your Pipes, you can use functions like RegexMatcher to search for ^www\\.google.* and place these in a separate column then use CountBy to come up with a count.

You should be able to accomplish this specific task within two pipes. One for to grab Google out of your links and the other to count them.

Hope this helps!

Engineiro
  • 1,146
  • 7
  • 10
0

It is possible in cascading. Suppose your field names are (url,count). Apply a function to add one more field named "domain" that contains value google if row contains the word google and discard url field. Now if you don't require any other domains then filter them out. So now you have two fields (domain,count) where domain contains only word google

Now use AggregateBy() , SumBy() functions of cascading.

SumBy any_name = new SumBy(field_name_to_sum , field_name_after_sum , dataType class);

Pipe result = new AggregateBy("name" , Pipe.pipes(sourcePipeName) , name_of_groupBy_field , number_of_SumBy_instances , name_of_sumBy_instance);

in your case it becomes

SumBy xyz = new SumBy(new Fields("count") , new Fields("combined_count") , Integer.class);

Pipe result = new AggregateBy("result" , Pipe.pipes(sourcePipeName) , new Fields("domain") , 1 , xyz);

So now result pipe contains a single row (google,count)

So the above code snippet will work similar to the below SQL Query.

select domain,sum(count) from source group by on domain;