3

I am trying to use U-SQL to remove duplicate, null,'',and Nan cells in a specific column called "Function" of a csv file. I also want to keep the Product column correctly aligned with the Function column after the blank rows are removed. So i would want to remove the same rows in the Product column as I do in the Function column to keep them properly aligned. I want to only keep one occurrence of a duplicate Function row. In this case I only want to keep the very first occurrence. The Product column has no empty cells and has all unique values. Any help is greatly appreciated. I know this can be done in a much easier way, but I want to use the code to automate the process as the Data in the DataLake changes over time. I think I am somewhat close in the code i currently have. The actual data set is a very large file and I am fairly certain that there are at least 4 duplicate values in the Functions column that aren't simply empty cells. I need to eliminate both duplicate values and empty cells in the Function column because empty cells are being recognized as duplicates as well. I want to be able to use the Function values as a primary key in the next step of my school project that wont include the Product column.

DECLARE @inputfile string = "/input/Function.csv";
//DECLARE @OutputUserFile string = "/output/Test_Function/UniqueFunction.csv";

@RawData =
    EXTRACT Function string,
            Product string

    FROM @inputfile
    USING Extractors.Csv(encoding: Encoding.[ASCII]);

// Query from Function data
// Set ROW_NUMBER() of each row within the window partitioned by Function field
@RawDataDuplicates=
    SELECT ROW_NUMBER() OVER (PARTITION BY Function) AS RowNum, Function AS function
    FROM @RawData;

// ORDER BY Function to see duplicate rows next to one another
@RawDataDuplicates2=
    SELECT *
    FROM @RawDataDuplicates
    ORDER BY function
    OFFSET 0 ROWS;

// Write to File
//OUTPUT @RawDataDuplicates2
//TO "/output/Test_Function/FunctionOver-Dups.csv"
//USING Outputters.Csv();

// GROUP BY and count # of duplicates per Function
@groupBy = SELECT Function, COUNT(Function) AS FunctionCount
FROM @RawData
GROUP BY Function
ORDER BY Function
OFFSET 0 ROWS;

// Write to file 
//OUTPUT @groupBy
//TO "/output/Test_Function/FunctionGroupBy-Dups.csv"
//USING Outputters.Csv();

@RawDataDuplicates3 =
    SELECT * 
    FROM @RawDataDuplicates2
    WHERE RowNum == 1;

OUTPUT @RawDataDuplicates3
TO "/output/Test_Function/FunctionUniqueEmail.csv"
USING Outputters.Csv(outputHeader: true);    

//OUTPUT @RawData
//TO @OutputUserFile
//USING Outputters.Csv(outputHeader: true);

I have also commented out some code that I don't necessarily need. When I run the code as it is, I am currently getting this error: this E_CSC_USER_REDUNDANTSTATEMENTINSCRIPT, Error Message: This statement is dead code.. – It does not give a line number but likely the "Function AS function" line?

Here is a sample file that is a small slice of the full spreadsheet and only includes data in the 2 relevant columns. The full spreadsheet has data in all columns. https://www.dropbox.com/s/auu2aco4b037xn7/Function.csv?dl=0

here is a screenshot of the output I get when I follow wBob's advice and click. enter image description here

Royale_w_cheese
  • 297
  • 2
  • 9

1 Answers1

2

You can apply a series of transformations to your data using string functions like .Length and ranking function like ROW_NUMBER to remove the records you want, for example:

@input = 
    EXTRACT 
        CompanyID             string,
        division              string,
        store_location        string,
        International_Id      string,
        Function              string,
        office_location       string,
        address               string,
        Product               string,
        Revenue               string,
        sales_goal            string,
        Manager               string,
        Country               string

    FROM "/input/input142.csv"
    USING Extractors.Csv(skipFirstNRows : 1 );


// Remove empty columns
@working =
    SELECT *
    FROM @input
    WHERE Function.Length > 0;


// Rank the columns by Function and keep only the first one
@working =
    SELECT CompanyID,
           division,
           store_location,
           International_Id,
           Function,
           office_location,
           address,
           Product,
           Revenue,
           sales_goal,
           Manager,
           Country
    FROM
    (
        SELECT *,
               ROW_NUMBER() OVER(PARTITION BY Function ORDER BY Product) AS rn
        FROM @working
    ) AS x
    WHERE rn == 1;


@output = SELECT * FROM @working;

OUTPUT @output TO "/output/output.csv"
USING Outputters.Csv(quoting:false);

My results: Results

wBob
  • 13,710
  • 3
  • 20
  • 37
  • 1
    If you run my code exactly as it is, changing only the file names then there should be no errors. When there are errors, double-click them in the Errors window and it normally will take you straight to them in the script. Read the error message and error type (eg your redundant code error above) - U-SQL is actually pretty good at telling you what’s wrong. – wBob Jul 25 '18 at 06:52
  • 1
    Well it's just an example, tailor it to fit your requirements. If you want quotes in your output, set it to `true`. – wBob Jul 25 '18 at 15:49
  • 1
    Ignore the job completes successfully without errors, then you can just double-click the out file name which will have appeared in your Outputs window. – wBob Jul 25 '18 at 18:47
  • i have posted a screenshot of the output i get when i follow your instructions. What am I doing wrong? Thanks for your patience. – Royale_w_cheese Jul 26 '18 at 14:32
  • I can't actually see an error message there. If you run the job, errors will appear in the Errors window or the Job summary window. – wBob Jul 30 '18 at 01:31
  • I am not getting any errors. But when I double click on the out file name in the Outputs window, the contents i posted in the screen shot is what i get. – Royale_w_cheese Jul 30 '18 at 13:08