I'am facing a problem parsing csv in Apache Beam pipeline project.
I used line.split(",") to get an Array of strings but i have csv fields that contains conversation that have "," character and | ect...
Here's snippets of my code:
public class ConvertBlockerToConversationOperation extends DoFn<String, PubsubMessage> {
private final Logger log = LoggerFactory.getLogger(ParseCsv.class);
@ProcessElement
public void processElement(ProcessContext c) {
String startConversationMessage = c.element();
JsonObject conversation = ParseCsv.getObjectFromCsv(startConversationMessage);
c.output(new PubsubMessage(conversation.toString().getBytes(),null ));
}
}
I am using TextIO.read() to read csv from a GC Storage:
public class CsvToPubsub {
public interface Options extends PipelineOptions {
@Description("The file pattern to read records from (e.g. gs://bucket/file-*.csv)")
@Required
ValueProvider<String> getInputFilePattern();
void setInputFilePattern(ValueProvider<String> value);
@Description("The name of the topic which data should be published to. "
+ "The name should be in the format of projects/<project-id>/topics/<topic-name>.")
@Required
ValueProvider<String> getOutputTopic();
void setOutputTopic(ValueProvider<String> value);
}
public static void main(String[] args) {
ConfigurationLoader configurationLoader = new ConfigurationLoader(args[0].substring(6));
PipelineUtils pipelineUtils = new PipelineUtils();
Options options = PipelineOptionsFactory
.fromArgs(args)
.withValidation()
.as(Options.class);
run(options,configurationLoader,pipelineUtils);
}
public static PipelineResult run(Options options,ConfigurationLoader configurationLoader,PipelineUtils pipelineUtils) {
Pipeline pipeline = Pipeline.create(options);
pipeline
.apply("Read Text Data", TextIO.read().from(options.getInputFilePattern()))
.apply("Transform CSV to Conversation", ParDo.of(new ConvertBlockerToConversationOperation()))
.apply("Generate conversation command", ParDo.of(new GenerateConversationCommandOperation(pipelineUtils)))
.apply("Partition conversations", Partition.of(4, new PartitionConversationBySourceOperation()))
.apply("Publish conveIorsations", new PublishConversationPartitionToPubSubOperation(configurationLoader, new ConvertConversationToStringOperation()));
return pipeline.run();
}
}
Is there any csv Library that support a TextIo output?