0

When using CSV Input utility in Pentaho (V7), we use trim type both to achieve the below. But it doesn't work as expected.

Here are the test data and expected output vs actual output

|Incoming Data                      |Expected Output        |Actual Output          |
|<space>abc<space>                  |abc                    |abc                    |
|abc<space>                         |abc                    |abc                    |
|<space>abc                         |abc                    |abc                    |
|"<space>abc<space>"                |<space>abc<space>      |abc                    |
|"<space>abc<space>"<space>         |<space>abc<space>      |abc                    |
|<space>"<space>abc<space>"         |<space>abc<space>      |"<space>abc            |
|<space>"<space>abc<space>"<space>  |<space>abc<space>      |"<space>abc            |
|"abc"<space>                       |abc                    |abc                    |
|<space>"abc"                       |abc                    |"abc                   |
|<space>"abc"<space>                |abc                    |"abc                   |

Can someone please guide me on this?

Srini V
  • 11,045
  • 14
  • 66
  • 89

2 Answers2

2

If there's no technical reason for using CSV-Input, use Text-File-Input instead. TFI handles CSV input much better. If possible, you should talk to the CSV producer about data quality, though.

UPDATE: TFI 6.1.0.1-196 preview output

enter image description here

Not so bad, when we accept that trimming in Kettle always is done to the field value, i.e. you can't protect leading or trailing spaces from trimming as expected in testcases 4 and 5.

Community
  • 1
  • 1
marabu
  • 1,166
  • 7
  • 9
1

It looks like the CSV input doesn't deal correctly with badly formed CSV data (surprise!). Having extra spaces between the delimiter and enclosure characters apparently doesn't sit well with the step. The trim function looks inside the enclosure to trim spaces, not outside.

I've tested the Text File Input step, which should be the default choice for CSV files as marabu says. Unfortunately, it gives the same undesired results as in the question.

The solution is to remove the double quotes (circled in red) from the enclosure definition box in the CSV input step. The step will then correctly trim spaces outside of the strings, quoted or not. You then put the data through a "Replace in String" step to replace the " by nothing (red underlines).

Sample transformation with settings

Cyrus
  • 2,135
  • 2
  • 11
  • 14