Removing delimiters from a single column of a csv file without opening the file

Question

I have the following contents of the csv file

Here are the contents of the CSV file:

Date_Added|this_flag|Name|DOB|SSN|ID

May 1st, 2015|Y|Jingle|heimerscmidt|19901002|123456789|3

May 1st, 2015|N|Jingleheimerscmidt|19901002|123456789|3
May 5th, 2015|Y|Jon|19901001|012345678|1
May 1st, 2015|N|Jon|19901002|012345678|1
May 1st, 2015|Y|Jacob|19901001|234567890|2
May 5th, 2015|N|Jingleheimerscmidt|19901001|123456789|3
May 1st, 2015|Y|Jingleheimerscmidt|19901001|123456789|3

As you can see in the bold and italic content, there is a pipe operator in the content apart from the pipe operator separted columns.I want to remove that pipe operator from the text without opening the csv file. Is there way to solve this problem either by writing a code or any other approach

without opening the file? You mean without opening in a text editor to do it manually? The file has to be opened in order to read its contents. You can bring it in as a single VARCHAR value and cut up the string based on your own rules.... such as split every column into multiple rows and determine when too many rows exist then concatenate back the rows and pivot back to tabular format — Matt, Sep 27 '16 at 20:07
if you have control over the generation of the csv file you can add text qualifiers, e.g. quotation marks, so that the pipes inside a text won't be confused with delimiters by the ingesting application (or your own data cleansing code) — Jayvee, Sep 27 '16 at 20:16
No matter what, you must open the file to change it - either through code or manually. — Missy, Sep 27 '16 at 20:22
@Matt yes. without opening it in manually. On concatenation, yes I thought of using combination of instr and substr. Any suggestions on that? — kpmandani, Sep 27 '16 at 20:33
@Missy Can you elaborate more on how can I open it through a code and get the desired result — kpmandani, Sep 27 '16 at 20:37
Even if you can open the file without "opening" it, you have another problem, As you illustrate, some but not all the "records" may have an extra pipe symbol. How will you recognize them? Is it that every "record" must have exactly six pipe symbols, and if there are six, remove the third? Or what other rule(s)? — , Sep 27 '16 at 20:51
This is the file I have received it from someone else. So it is not in my control the generation of the csv file. When I received it, yes it came with each of them as pipe symbols — kpmandani, Sep 27 '16 at 21:11

Matt · Answer 1 · 2016-09-28T04:23:17.667

okay I know you tagged oracle so perhaps yourself or another Oracle guru can migrate this solution from sql-server. I know that oracle is capable of each of these operations.

Normally I would say you want a fast/fancy way of splitting a string but in this case you need to maintain ordinal position of the strings between delimiters. So I thought of a way you could do this.

1) First import CSV into a temp table as all 1 column. now this will be an issue if your CRLF is also found within the Name column.... but we will assume it isn't because you didn't specify it.

2) Build a row_number on that table to use as a fake primary key, and determine when there are more delimiters than there should be.

3) use a recursive cte to spilt the string into to rows and maintain an ordinal position of the substring in the original string by which to concatenate later.

4) Determine what rows to group by altering OrdinalPostion by MergePositions and generating a DENSE_RANK() based on it

5) Conditional Aggregation using the OrdinalGroup as the column number and then use a concatenation method to combine all OrdginalGroup 3 rows.

DECLARE @CSV as TABLE (LumpedColumns NVARCHAR(MAX))
INSERT INTO @CSV VALUES
('May 1st, 2015|Y|Jingle|he|imerscmidt|19901002|123456789|3')
,('May 1st, 2015|N|Jingleheimerscmidt|19901002|123456789|3')
,('May 5th, 2015|Y|Jon|19901001|012345678|1')
,('May 1st, 2015|N|Jon|19901002|012345678|1')
,('May 1st, 2015|Y|Jacob|19901001|234567890|2')
,('May 5th, 2015|N|Jingleheimerscmidt|19901001|123456789|3')
,('May 1st, 2015|Y|Jingleheimerscmidt|19901001|123456789|3')

;WITH cteFakePrimaryKey AS (
    SELECT
       LumpedColumns
       ,CASE WHEN LEN(LumpedColumns) - LEN(REPLACE(LumpedColumns,'|','')) > 5 THEN
          LEN(LumpedColumns) - LEN(REPLACE(LumpedColumns,'|','')) - 5 ELSE 0 END as MergeXPositions
       ,ROW_NUMBER() OVER (ORDER BY (SELECT 0)) as PK
    FROM
       @CSV
)


, cteRecursive AS (
    SELECT
       PK
       ,LumpedColumns
       ,MergeXPositions
       ,LEFT(LumpedColumns,CHARINDEX('|',LumpedColumns)-1) as ColValue
       ,RIGHT(LumpedColumns,LEN(LumpedColumns) - CHARINDEX('|',LumpedColumns)) as Remaining
       ,1 As OrdinalPosition
    FROM
       cteFakePrimaryKey

    UNION ALL

    SELECT
       PK
       ,LumpedColumns
       ,MergeXPositions
       ,LEFT(Remaining,CHARINDEX('|',Remaining)-1)
       ,RIGHT(Remaining,LEN(Remaining) - CHARINDEX('|',Remaining))
       ,OrdinalPosition + 1
    FROM
       cteRecursive
    WHERE
       Remaining IS NOT NULL AND CHARINDEX('|',Remaining) > 0

    UNION ALL

    SELECT 
       PK
       ,LumpedColumns
       ,MergeXPositions
       ,Remaining
       ,NULL
       ,OrdinalPosition + 1
    FROM
       cteRecursive
    WHERE Remaining IS NOT NULL AND CHARINDEX('|',Remaining) = 0
)

, cteOrdinalGroup AS (
    SELECT
       PK
       ,LumpedColumns
       ,ColValue
       ,OrdinalPosition
       ,DENSE_RANK() OVER (PARTITION BY PK ORDER BY
          CASE
             WHEN OrdinalPosition < 3 THEN OrdinalPosition
             WHEN OrdinalPosition > (3 + MergeXPositions) THEN OrdinalPosition
          ELSE 3 END ) as OrdinalGRoup

    FROM
       cteRecursive
)

SELECT
    PK
    ,LumpedColumns
    ,MAX(CASE WHEN OrdinalGRoup = 1 THEN ColValue END) as Date_Added
    ,MAX(CASE WHEN OrdinalGRoup = 2 THEN ColValue END) as this_flag
    ,STUFF(
        (SELECT '|' + ColValue
        FROM
            cteOrdinalGroup g2
        WHERE
             g1.PK = g2.PK
             AND g2.OrdinalGroup = 3
        ORDER BY
          g2.OrdinalPosition
        FOR XML PATH(''))
        ,1,1,'') as name
    ,MAX(CASE WHEN OrdinalGRoup = 4 THEN ColValue END) as DOB
    ,MAX(CASE WHEN OrdinalGRoup = 5 THEN ColValue END) as SSN
    ,MAX(CASE WHEN OrdinalGRoup = 6 THEN ColValue END) as ID
FROM
    cteOrdinalGroup g1
GROUP BY
    PK
    ,LumpedColumns
ORDER BY
    PK

score 0 · Answer 2 · answered Sep 28 '16 at 12:13

If that's just one case you need to handle and you don't want to modify the file and you need to do that in Informatica, you can change the Input Type session property for this Source Qualifier from File to Command and use sed to do the replacement, like:

cat $$FileName | sed -e 's/Jingle|heimerscmidt/Jingleheimerscmidt/g'

It's not a very nice solution hence it's not a generic one. But perhaps this will do or at least give you some ideas.

Removing delimiters from a single column of a csv file without opening the file

2 Answers2