0

I'm trying to get the line numbers from a file while reading and loading it to a table using Polybase in Azure Synapse. For example, say the file(csv) has 3 columns, I need to load the target table which has 4 (3 from from the file and a column to hold the row/line number from the file). Is there any polybase property I can use to get the line?

Sample File

Expected Output

I tried using row_number function while selecting from the external table, but this gives incorrect line numbers (jumbles up this the line numbers) when the data is large. I guess this is due to the Azure Synapse ( master/slave ) conceptually build.

CREATE EXTERNAL TABLE dbo.EXAMPLE_EXT (
   COL1 VARCHAR(5),
   COL2 VARCHAR(5),
   COL3 VARCHAR(5))
WITH (
    LOCATION='../',
    DATA_SOURCE=Example_DataSource,
    FILE_FORMAT=Example_csvFile
    );

INSERT INTO EXAMPLE_TARGET_TABLE (LINE_NUMBER,COL1,COL2,COL3)
SELECT ROW_NUMBER() OVER(ORDER BY NULL) AS LINE_NUMBER, COL1, COL2, COL3
FROM dbo.EXAMPLE_EXT

Output from above code

1 Answers1

0

Yes you can add line number using "create table as select". Below is an example scripts you can skip number 1 if you have external data source created.

 1.  
  CREATE EXTERNAL DATA SOURCE <data_source_name>
  WITH
    ( [ LOCATION = '<prefix>://<path>[:<port>]' ]
      [ [ , ] CREDENTIAL = <credential_name> ]
      [ [ , ] TYPE =HADOOP ]
     )
 [ ; ]

2. 
CREATE EXTERNAL TABLE dbo.EXAMPLE_EXT (
   COL1 VARCHAR(5),
   COL2 VARCHAR(5),
   COL3 VARCHAR(5))
 WITH (
    LOCATION='../',
    DATA_SOURCE=Example_DataSource,
    FILE_FORMAT=Example_csvFile
    );

3. 
CREATE TABLE EXAMPLE_TARGET_TABLE
WITH
(DISTRIBUTION = ROUND ROBIN
,CLUSTERED COLUMNSTORE INDEX
)

AS

SELECT ROW_NUMBER() OVER(ORDER BY NULL) AS LINE_NUMBER, 
*
FROM dbo.EXAMPLE_EXT;
Lenroy Yeung
  • 291
  • 3
  • 8