4

Assume a one-line string with multiple consecutive key-value pairs, separated by a space, but with space allowed also within values (not in keys), e.g.

key1=one two three key2=four key3=five six key4=seven eight nine ten

Correctly extracting the key-value pairs from above would produce the following mappings:

"key1", "one two"
"key2", "four"
"key3", "five six"
"key4", "seven eight nine ten"

where "keyX" can be any sequence of characters, excluding space.

Trying something simple, like

([^=]+=[^=]+)+

or similar variations is not adequate.

Is there a regex to fully handle such extraction, without any further string processing?

PNS
  • 19,295
  • 32
  • 96
  • 143

4 Answers4

15

Try with a lookahead:

(\b\w+)=(.*?(?=\s\w+=|$))

As a Java String:

"(\\b\\w+)=(.*?(?=\\s\\w+=|$))"

Test at regex101.com; Test at regexplanet (click on "Java")

Jonny 5
  • 12,171
  • 2
  • 25
  • 42
3

\1 contains the key and \2 the value:

(key\d+)=(.*?)(?= key\d+|$)

Escape \ with \\ in Java:

(key\\d+)=(.*?)(?= key\\d+|$)

Demo: https://regex101.com/r/dO8kM2/1

AMDcze
  • 516
  • 3
  • 13
  • The "key" string was just a placeholder name. Keys can have any value, without space. It does not work, but +1 even if you don't modify it. Thanks. – PNS Jan 24 '15 at 23:00
  • I saw Johny5's answer, which was better and exactly what you wanted, so I decided not to edit my answer. ;) – AMDcze Jan 24 '15 at 23:10
1

Rather then a regular expression, I suggest you parse it using indexOf. Something like,

String in = "key1=one two three key2=four key3=five six "
        + "key4=seven eight nine ten";
Map<String, String> kvp = new LinkedHashMap<>();
int prev = 0;
int start;
while ((start = in.indexOf("key", prev)) != -1) {
    // Find the next "=" sign.
    int eqlIndex = in.indexOf("=", start + 3);
    // Find the end... maybe the end of the String.
    int end = in.indexOf("key", eqlIndex + 1);
    if (end == -1) {
        // It's the end of the String.
        end = in.length();
    } else {
        // One less than the next "key"
        end--;
    }
    kvp.put(in.substring(start, eqlIndex),
            in.substring(eqlIndex + 1, end).trim());
    prev = start + 3;
}
for (String key : kvp.keySet()) {
    System.out.printf("%s=\"%s\"%n", key, kvp.get(key));
}

Output is

key1="one two three"
key2="four"
key3="five six"
key4="seven eight nine ten"
Elliott Frisch
  • 198,278
  • 20
  • 158
  • 249
1

Something like this is also possible if whitespaces are not duplicated:

([^\\s=]+)=([^=]+(?=\\s|$))

otherwise you can always write this:

([^\\s=]+)=([^=]+\\b(?=\\s|$))

These patterns are a good solution if key names are not too long since they use the backtracking.

you can also write this that needs at most one step of backtracking:

([^\\s=]+)=(\\S+(?>\\s+[^=\\s]+)*(?!=))
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • I tried all 3 patterns and none seems to work. Maybe they need to be modified slightly. +1 anyway and thanks. – PNS Jan 25 '15 at 03:50
  • @PNS: the three pattern are written to be used directly in you Java code, if you want to test them in an online regex tester (regex101.com or regexplanet), you need to replace double backslashes, with simple backslashes. The three patterns work. And the last one is probably the more efficient. (see debugger in regex101) – Casimir et Hippolyte Jan 25 '15 at 12:38