1

I'm trying to create a very rudimentary parser that would take a multi-line string and convert that into an array containing objects. The string would be formatted like this:

title: This is a title
description: Shorter text in one line
image: https://www.example.com

title: This is another title : with colon
description: Longer text that potentially
could span over several new lines,
even three or more
image: https://www.example.com


title: This is another title, where the blank lines above are two
description: Another description
image: https://www.example.com

The goal is to turn this into an array where each section separated by one or more empty lines would be an object containing key/value pairs with the colon as the separator in between the key and value, and one new line as the separator in between individual key/value pairs. So the input above should result in the following output:

[
  {
    title: "This is a title",
    description: "Shorter text in one line",
    image: "https://www.example.com"
  },
  {
    title: "This is another title : with colon",
    description: "Longer text that potentially could span over several new lines, even three or more",
    image: "https://www.example.com"
  },
  {
    title: "This is another title, where the blank lines above are two",
    description: "Another description",
    image: "https://www.example.com"
  }
]

I've started with this CodePen, but as you can see, the code currently have a few problems that needs to be solved before it's complete.

  1. If colons are used in the value, they shouldn't be split. I somehow need to make the split by the first occurence of a colon and then ignore additional colons in the value. This currently results in the following:
// Input:
//     title: This is another title : with colon
//     image: https://www.example.com

{
  image: " https",
  title: " This is another title "
}
  1. Some lines could contain a value that spans over multiple lines. The line breaks in the value should be concatenated into a single line and not be treated as a separator for a new key/value pair. This currently results in the following:
// Input:
//     description: Longer text that potentially
//     could span over several new lines,
//     even three or more

{
  could span over several new lines,: undefined,
  description: " Longer text that potentially",
  even three or more: undefined
}

Would greatly appreciate any help with how to approach this given the code I have so far. Any suggestions on how to optimise the code to be more performance efficient is also very welcome.

tobiasg
  • 983
  • 4
  • 17
  • 35

2 Answers2

1

As a partial-answer, the below will handle the multiple semicolons on one line:

var input = `title: This is a title
description: Shorter text in one line
image: https://www.example.com

title: This is another title : with colon
description: Longer text that potentially
could span over several new lines,
even three or more
image: https://www.example.com


title: This is another title, where the blank lines above are two
description: Another description
image: https://www.example.com`;

var finalArray = [];
var first = input.split(/\n\s*\n/);

console.log("Array with sections split:", first);

first.forEach(function (section) {
  var result = section.split("\n").reduce(function (o, pair) {
    pair = pair.split(":");
    return (o[pair.shift()] = pair.join(':')), o;
  }, {});
  console.log(result);
  finalArray.push(result);
});

console.log("Array of sections as objects:", finalArray);

This still doesn't handle multi-line values, but the issue is that in your schema there is no way to determine when a new line means the start of a new property and when it is just the continuation of a value. You already rule out using colon and comma separation so you've now got no way to solve your second issue.

I'd advise using a special character that you don't allow in the main text body to denote the end of a key-value pair and splitting based on that.

Tom
  • 1,158
  • 6
  • 19
  • Thank you! Yes, I actually just realised the problem that you're talking about. Not sure if this is a best practice, but I've seen similar formats where empty lines in a text are done with an indented dot ` .` and new lines are done with indentation only ` New line here.`. Maybe that's the way to go? – tobiasg Feb 04 '22 at 10:53
  • Seems like the colons in the values are removed from the string though. Is there any way to keep them? – tobiasg Feb 04 '22 at 11:05
  • Just updated the code where you use `join`. The colons that were used to split were being replaced by blank strings. They're now reinstated. Wrt splitting, using this method is always vulnerable to string injection in some way. Assuming you want to still stick with this general method, I'd advise using a very uncommon unicode character (or a series of them) to split with that you can almost guarantee won't be in the body. – Tom Feb 04 '22 at 11:08
  • Thank you! I'll try it out. And thank you for pointing out the vulnerability aspect. Good to have in mind. – tobiasg Feb 04 '22 at 12:02
0

There is a very simple rule if you work with text, always keep in mind regular expressions.

Try this approach:

const data = `title: This is a title
description: Shorter text in one line
image: https://www.example.com

title: This is another title : with colon
description: Longer text that potentially
could span over several new lines,
even three or more
image: https://www.example.com


title: This is another title, where the blank lines above are two
description: Another description
image: https://www.example.com`;

const bloks = data.split(/\n\s*\n/);

result = bloks.map((blok) => {
  const title = blok.match(/(?<=title:)([\S\s]*\n?)(?=description:)/gm).join(' ').trim();
  const description = blok.match(/(?<=description:)([\S\s]*\n?)(?=image:)/gm).join(' ').replaceAll('\n', ' ').trim();
  const image = blok.match(/(?<=image:)([\S\s]*\n?)(?=)/gm).join(' ').trim();

  return { title, description, image };
})

console.log(result);
.as-console-wrapper { max-height: 100% !important; top: 0; }
A1exandr Belan
  • 4,442
  • 3
  • 26
  • 48