0

I have jsons like,

{
  "name":"someone",
  "job":"doctor",
  "etc":"etc"
}

in every json there is different value for "job" like doctor, pilot, driver, watchman etc. i want to separte each json based on the "job" value and store it in diffrent locations like /home/doctor, /home/pilot, /home/driver etc.

i have tried SplitStream function to do this but i have to specify those value to match the condition.

public class MyFlinkJob {   
    private static JsonParser jsonParser = new JsonParser();
    private static String key_1 = "doctor";
    private static String key_2 = "driver";
    private static String key_3 = "pilot";
    private static String key_default = "default";

    public static void main(String args[]) throws Exception {
        Properties prop = new Properties();
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        Properties props = new Properties();
        props.setProperty("bootstrap.servers", kafka);
        props.setProperty("group.id", "myjob");

        FlinkKafkaConsumer<String> myConsumer = new FlinkKafkaConsumer<>("topic", new SimpleStringSchema(), props);
        DataStream<String> record = env.addSource(myConsumer).rebalance()

        SplitStream<String> split = record.split(new OutputSelector<String>() {
            @Override
            public Iterable<String> select(String val) {
                JsonObject json = (JsonObject)jsonParser.parse(val);
                String jsonValue = CommonFields.getFieldValue(json, "job");
                List<String> output = new ArrayList<String>();

                if (key_1.equalsIgnoreCase(jsonValue)) {
                    output.add("doctor");
                } else if (key_2.equalsIgnoreCase(jsonValue)) {
                    output.add("driver");
                } else if (key_3.equalsIgnoreCase(jsonValue)) {
                    output.add("pilot");
                } else {
                    output.add("default");
                }
                return output;
            }});

        DataStream<String> doctor = split.select("doctor");
        DataStream<String> driver = split.select("driver");
        DataStream<String> pilot = split.select("pilot");
        DataStream<String> default1 = split.select("default");
        doctor.addSink(getBucketingSink(batchSize, prop, key_1));
        driver.addSink(getBucketingSink(batchSize, prop, key_2));
        pilot.addSink(getBucketingSink(batchSize, prop, key_3));
        default1.addSink(getBucketingSink(batchSize, prop, key_default));
        env.execute("myjob");
    } catch (IOException ex) {
        ex.printStackTrace();
    } finally {
        if (input != null) {
            try {
                input.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }
}

public static BucketingSink<String> getBucketingSink(Long BatchSize, Properties prop, String key) {
    BucketingSink<String> sink = new BucketingSink<String>("hdfs://*/home/"+key);
    Configuration conf = new Configuration();
    conf.set("hadoop.job.ugi", "hdfs");
    sink.setFSConfig(conf);
    sink.setBucketer(new DateTimeBucketer<String>(prop.getProperty("DateTimeBucketer")));
    return sink;
}
}

suppose if any other value comes in "job" like engineer or something else and i have not specified in class then it goes to default folder is there any way to split those json events automatically based on the value of "job" without specifing it and create a path which contains name of value like /home/enginerr?

Gaurav
  • 173
  • 1
  • 13

1 Answers1

1

You want to use the BucketingSink, which supports writing out records into separate buckets based on the value of a field. I'm probably have a map function that takes in the JSON string, parses it, and emits a Tuple2<String, String>, where the first element is the value of the job field in the JSON, and the second element is the full JSON string.

kkrugler
  • 8,145
  • 6
  • 24
  • 18
  • thanks for reply @kkrugler , as you said now i have created a function which emits the Tuple2 but i am confused how to use this with bucketingSink – Gaurav Apr 30 '19 at 08:44
  • The JavaDocs for BucketingSink that I linked to in my answer show the general form for how to use it. For your particular case, you'd need a class that implements the Bucketer interface, and constructs the bucket path based on the `.f0` field in the element that its `getBucketPath()` method is passed. – kkrugler Apr 30 '19 at 20:31