2

I applied the GREL expression "value.split(/a/)" to some cells:

abcdef   -> [ "", "bcdef" ]
bcdefa   -> [ "bcdef" ]
badef    -> [ "b", "def" ]

I can't understand why the first cell gives me a "" element in the resulting table. Is it a bug?

Thanks!

Mathieu Saby
  • 125
  • 5

1 Answers1

1

I don't know Java enough to comment on the source code for this function, but according to one of the developers of Open Refine this behavior is normal (edit : More details in Owen's comment, below). This is why there are other functions to split a string.

value.smartSplit(/a/), for example, gives a more consistent result when sep is at the begining or at the end of the string:

row value   value.smartSplit(/a/)
1.  abcdef  [ "", "bcdef" ]
2.  bcdefa  [ "bcdef", "" ]
3.  badef   [ "b", "def" ]

This is the same result as using partition() with the omitfragment = true option enabled:

row value   value.partition(/a/, true)
1.  abcdef  [ "", "bcdef" ]
2.  bcdefa  [ "bcdef", "" ]
3.  badef   [ "b", "def" ]
Ettore Rizza
  • 2,800
  • 2
  • 11
  • 23
  • 2
    The OpenRefine 'split' function (with a reg exp) simply uses the Java String 'split' method. This method results in an array which "contains each substring of the input sequence that is terminated by another subsequence that matches this pattern" - that is to say, it always takes the pattern you've matched as the terminator of a sequence - when the pattern matches on the first character in the string, the substring that precedes it is empty - hence the first empty substring. More at https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#split(java.lang.CharSequence,%20int) – Owen Stephens Sep 25 '17 at 08:20
  • Thinking again about that... Is it normal to have ["","abc"] even if preserveAllTokens = false ? – Mathieu Saby Nov 27 '17 at 15:59
  • @MathieuSaby With which function? smartSplit() has no preserveAllTokens option. – Ettore Rizza Nov 28 '17 at 13:55