0

I am struggling to understand why Dataprep is assigning mismatched values to numerical values that I am trying to import from a .csv file.

In my excel, all looks normal:

enter image description here

but in dataprep, this is the value I am getting:

enter image description here

It seems for most numbers above 1000.

How can I prevent this? I saved the file as a .csv

WJA
  • 6,676
  • 16
  • 85
  • 152

1 Answers1

1

This is most likely due to the original formatting in Excel (specifically the Accounting and "Comma Style" number formats do this, because it justifies the currency symbol or sign as part of the formatting). In these cases, you'll also notice that pasting from Excel includes a leading and trailing whitespace character.

Dataprep doesn't spend too much time thinking for you—in this case, they take the conservative angle of giving you the raw data and letting you decide whether you need to reformat it.

To confirm that Dataprep isn't misbehaving, you only need to open the CSV in a text editor—you'll most likely see those same quoted strings. This is also common when other systems generate CSVs with number formatting applied (forcing the values to be quoted strings in the CSV). Similarly, any text columns containing commas will typically be quoted (as required, since this is typically the delimiter and has special meaning).

Thankfully, this is an easy fix. You'll also have to replace the commas if you want it as a Decimal type.


Simple Replacements:

Interface:

  1. Format > Trim Leading & Trailing Quotes
  2. Format > Trim Leading & Trailing Whitespace
  3. Replace > Text or Pattern (replace ","; make sure you check the "Match all occurrences box

Resulting Wrangle script:

textformat col: col1 type: trimquotes
textformat col: col1 type: trimwhitespace
replacepatterns col: col1 with: '' on: ',' global: true

Regular Expression (1 step replacement):

replacepatterns col: col1 with: '' on: /[^0-9.]/ global: true

In a mixed team with folks who don't know regular expressions, the former is sometimes a bit clearer and less intimidating—but otherwise it's just a lot easier to do in a single step.

justbeez
  • 1,367
  • 7
  • 12
  • Yeah this makes sense. Where do you replace this, in dataprep as a new rule? When I opened the .csv file, I noticed that I don't have the ',', but just whitespace sometimes. – WJA Apr 05 '19 at 07:20
  • 1
    @JohnAndrews Yes, you can place these as new steps in your recipe. A handy Dataprep trick is that you can hit the "New Step" button, and then paste the wranglescript (one line at a time) and it'll make all the same selections for you. This also works from the interface—on the menu for any step, you can click "copy to clipboard" and then do the same, which saves a lot of time if you need to reuse or change the order of steps. In this case, you'll just have to change the column name and then you should be good to go. If this all works out for you, be sure to mark the answer as accepted. – justbeez Apr 05 '19 at 15:26
  • Do you recommend this filtering at the beginning of the recipe? – WJA Apr 05 '19 at 15:35
  • 1
    @JohnAndrews I usually do it fairly early so I don't have to account for it in future steps. It's unlikely to affect performance much though unless you're dealing with very large datasets. Generally I try to use the following order for performance and reusability (especially since our team often copies old recipes as starting points): 1. Drop unneeded columns 2. Joins/Unions if needed 3. Filter out unneeded rows 4. Rename columns 5. Trim quotes/whitespace and replace missing with NULL (you can replace on `*` to do all columns at once) 6. Other data wrangling 7. Fix incorrect types – justbeez Apr 05 '19 at 16:03