3

parse_number from readr fails if the character string contains a . It works well with special characters.

library(readr)

#works
parse_number("%ç*%&23")

#does not work
parse_number("art. 23")

Warning: 1 parsing failure.
row col expected actual
  1  -- a number      .

[1] NA
attr(,"problems")
# A tibble: 1 x 4
    row   col expected actual
  <int> <int> <chr>    <chr> 
1     1    NA a number .

Why is this happening?

Update:

The excpected result would be 23

captcoma
  • 1,768
  • 13
  • 29
  • `parse_number("art.23")` yields `0.23` so the period is interpreted as being the start of a floating point number. `. 23` is an ill-formed literal. – John Coleman Apr 20 '20 at 17:37
  • `. ` belongs clearly to `art`, and `art.` is the abbrevation for `article`, I think abbrevations are common in character strings – captcoma Apr 20 '20 at 18:12
  • 1
    I don't doubt that human beings would parse that period as part of the an abbreviation, but `parse_number` seems to use a regular expression that regards the period as always being part of a number. Perhaps that function could be improved to handle cases such as this. – John Coleman Apr 20 '20 at 18:15

1 Answers1

5

There is a space in after the dot which is causing an error. What is the expected number from this sequence (0.23 or 23)?

parse_number seems to look for decimal and grouping separators as defined by your locale, see the documentation here https://www.rdocumentation.org/packages/readr/versions/1.3.1/topics/parse_number

You can opt to change the locale using the following (grouping_mark is a dot with a space):

parse_number("art. 23",  locale=locale(grouping_mark=". ", decimal_mark=","))
Output: 23

or remove the space in front:

parse_number(gsub(" ", "" , "art. 23")) 
Output: 0.23 

Edit: To handle dots as abbreviations and numbers use the following:

library(stringr)

> as.numeric(str_extract("art. 23", "\\d+\\.*\\d*"))
[1] 23
> as.numeric(str_extract("%ç*%&23", "\\d+\\.*\\d*"))
[1] 23

The above uses regular expressions to identify number patterns within strings.

  • \\d+ finds a digits
  • \\.* finds a dot
  • \\d* finds the remaining digits

Note: I am no expert on regex but there are plenty of other resources that will make you one

Jamie_B
  • 299
  • 1
  • 5
  • Thank you for your answer. In my case, `.` is used for abbreviations and also as decimal mark. So changing the locale will not help in my case. – captcoma Apr 20 '20 at 18:20
  • @captcoma, I've updated my answer. I would suggest reading up on regex/regular expressions to parse out the numbers you require from the strings – Jamie_B Apr 20 '20 at 21:13