1

I found a similar question on stack overflow. This approach worked fine with just a couple of columns But I realised this method is not possible for csv's with a large number of Columns.

I have a csv with 75 columns. I decided to follow this approach (Same link as mentioned above). As asked to do in that question. I added the UpdateRecord processor and added the CSVReader and CSVWriter. Then as told I entered my SchemaText. Which was pretty long as it required me to define the entire 70 columns. Then CSVRecordSetWriter was told to be invalid.

I realised after a certain number of column definitions I included in the schema it became invalid.

Part of my schema looks like this:

{
   "type":"record",
   "name":"test2.csv",
   "namespace":"my.namespace",
   "fields":[
      {
         "name":"download",
         "type":"string"
      },
      {
         "name":"upload",
         "type":"string"
      }
      .
      .
      .
      .
      {
         "name":"operatorId",
         "type":"string"
      },
      {
         "name":"errorCode",
         "type":"string"
      }      
   ]
}

Also my csv contains headers.

Objective: I need to map the data in the errorCode Column to a new column named errorMean. Hope you can suggest a method I can achieve this. Fell free to give a solution which can even completely skip the process of writing down the Schema Text.

Himsara Gallege
  • 934
  • 1
  • 8
  • 24
  • Hey Himsara, have you tried doing the same, but saving the schema to an AvroSchemaRegistry? I can't test without having access to your full schema. – Ian Neethling Nov 07 '19 at 06:59

1 Answers1

2

I found a similar question on stack overflow. This approach worked fine with just a couple of columns But I realised this method is not possible for csv's with a large number of Columns.

To avoid providing a very large schema, you set the CSVReader's Schema Access Strategy to Infer Schema and CSVRecordSetWriter's Schema Access Strategy to Inherit Record Schema. So when the CSV is read, the schema will be inferred. The same schema will then be used to write the CSV.

enter image description here

The rest of the mapping works the same as described in the answer you linked.

DarkLeafyGreen
  • 69,338
  • 131
  • 383
  • 601
  • 2
    The 'Inherit Record Schema' schema access strategy simply tells any RecordSetWriter to use the schema that the RecordReader used, therefore it does not necessarily mean that it will get it from the CSV header. – Ian Neethling Nov 07 '19 at 06:47
  • 1
    @IanNeethling thanks Ian! I corrected that in my answer. – DarkLeafyGreen Nov 07 '19 at 07:33
  • @Upvote then the processor isn't said to be invalid but it replaces all the values in every column. I only want to to replace the values in the new column. – Himsara Gallege Nov 07 '19 at 08:11
  • @HimsaraGallege most likely your regex is wrong. Please provide the regex and and example input. – DarkLeafyGreen Nov 07 '19 at 08:34