1

I'am facing a problem parsing csv in Apache Beam pipeline project.

I used line.split(",") to get an Array of strings but i have csv fields that contains conversation that have "," character and | ect...

Here's snippets of my code:

    public class ConvertBlockerToConversationOperation extends DoFn<String, PubsubMessage>   {

private final Logger log = LoggerFactory.getLogger(ParseCsv.class);

@ProcessElement
public void processElement(ProcessContext c) {
    String startConversationMessage = c.element();
    JsonObject conversation = ParseCsv.getObjectFromCsv(startConversationMessage);
    c.output(new PubsubMessage(conversation.toString().getBytes(),null ));
}
     }

I am using TextIO.read() to read csv from a GC Storage:

    public class CsvToPubsub {

public interface Options extends PipelineOptions {
    @Description("The file pattern to read records from (e.g. gs://bucket/file-*.csv)")
    @Required
    ValueProvider<String> getInputFilePattern();
    void setInputFilePattern(ValueProvider<String> value);

    @Description("The name of the topic which data should be published to. "
            + "The name should be in the format of projects/<project-id>/topics/<topic-name>.")
    @Required
    ValueProvider<String> getOutputTopic();
    void setOutputTopic(ValueProvider<String> value);
}

public static void main(String[] args) {
    ConfigurationLoader configurationLoader = new ConfigurationLoader(args[0].substring(6));
    PipelineUtils pipelineUtils = new PipelineUtils();

    Options options = PipelineOptionsFactory
            .fromArgs(args)
            .withValidation()
            .as(Options.class);

    run(options,configurationLoader,pipelineUtils);
}

public static PipelineResult run(Options options,ConfigurationLoader configurationLoader,PipelineUtils pipelineUtils) {

    Pipeline pipeline = Pipeline.create(options);
    pipeline
            .apply("Read Text Data", TextIO.read().from(options.getInputFilePattern()))
            .apply("Transform CSV to Conversation", ParDo.of(new ConvertBlockerToConversationOperation()))
            .apply("Generate conversation command", ParDo.of(new GenerateConversationCommandOperation(pipelineUtils)))
            .apply("Partition conversations", Partition.of(4, new PartitionConversationBySourceOperation()))
            .apply("Publish conveIorsations", new PublishConversationPartitionToPubSubOperation(configurationLoader, new ConvertConversationToStringOperation()));

    return pipeline.run();
}
     }

Is there any csv Library that support a TextIo output?

Salim Ben Hassine
  • 338
  • 1
  • 5
  • 19
  • You can output CSV with TextIO, what's your problem ? – vdolez Sep 17 '18 at 15:08
  • The csv is an input not output for me, the problem that i have CSV line like this one: "name","text, hello everyone",{CONVERSATION,TALK} i need to get: name text, hello everyone {CONVERSATION,TALK} – Salim Ben Hassine Sep 17 '18 at 19:28
  • Then you need to check how to parse a CSV which can contain in its values the splitting char. This has little to do with Beam. – vdolez Sep 19 '18 at 05:17
  • TextIO is not suitable for read of multiline CVS input. Instead you should use FileIO as described here: https://stackoverflow.com/questions/47668101/how-to-skip-carriage-returns-in-csv-file-while-reading-from-cloud-storage-using – trunikov Mar 02 '19 at 07:23

0 Answers0