-2

Following is the string from which I need to extract the markdown bulletpoints.

This is a paragraph with bulletpoints. * Bulletpoint1 * Bulletpoint2 * Bulletpoint3 and this is some another text.

I want to extract "* Bulletpoint1 * Bulletpoint2 * Bulletpoint3" as a substring from the actual string. Following is the code to extract the substring.

private List<String> extractMarkdownListUsingRegex(String markdownName) {
    String paragraphText = this.paragraph.getText();
    List<String> markdown = new ArrayList<String>();
    String regex = regexMap.get(markdownName);
    Pattern pattern = Pattern.compile(regex);
    Matcher matcher = pattern.matcher(paragraphText);
    while(matcher.find()) {
        markdown.add(matcher.group());
    }
    return markdown;
}

The regex is as follows:

regexMap.put("bulletpoints", "/\\d\\.\\s+|[a-z]\\)\\s+|(\\s\\*\\s+|[A-Z]\\.\\s+)|[IVX]+\\.\\s+/g");

The above code is extracting

[*, *, *]

instead of

[* Bulletpoint1, * Bulletpoint2, * Bulletpoint3]

Can anyone please guide me where I am going wrong in this regard?

  • 1
    Remove `/` at the start and `/g` at the end. – Wiktor Stribiżew Aug 29 '23 at 08:57
  • Just `\*\s+\w+` can meet your requirement. Check [this demo](https://regex101.com/r/XL7nQe/1). And, [here](https://regex101.com/r/XL7nQe/1/codegen?language=java) you can get the generated Java code. – Arvind Kumar Avinash Aug 29 '23 at 09:04
  • @ArvindKumarAvinash your regex is not working for the string "This is a paragraph with bulletpoints. * Bulletpoint1 and some text * Bulletpoint2 * Bulletpoint3 and this is some another text." It is just extracting "* Bulletpoint1 * Bulletpoint2 * Bulletpoint3", However, I am looking for bulletpoints with long strings that contains multiple words. Any guidance would be highly appreciated. – Furqan Ahmed Aug 29 '23 at 10:16
  • Your current regex implies support for various bullet point types. In that case, you can probably use a pattern like `"(?s)(?:\\d\\.|[a-z]\\)|\\*|[A-Z]\\.|[IVX]+\\.)\\s++(?:(?!(?:\\d\\.|[a-z]\\)|\\*|[A-Z]\\.|[IVX]+\\.)\\s).)*"` – Wiktor Stribiżew Aug 29 '23 at 10:42

1 Answers1

0

The regex pattern you've provided,

/\\d\\.\\s+|[a-z]\\)\\s+|(\\s\\*\\s+|[A-Z]\\.\\s+)|[IVX]+\\.\\s+/g 

appears to be quite complex and may not correctly match the markdown bulletpoints. Let's simplify the pattern to achieve your desired result.

private List<String> extractMarkdownListUsingRegex(String markdownName) {
    String paragraphText = this.paragraph.getText();
    List<String> markdown = new ArrayList<>();
    String regex = "\\*\\s+[^*]+";
    Pattern pattern = Pattern.compile(regex);
    Matcher matcher = pattern.matcher(paragraphText);
    while (matcher.find()) {
        markdown.add(matcher.group());
    }
    return markdown;
}

With this regex pattern, it should correctly match lines that start with an asterisk followed by a space and then capture everything until the next asterisk.

prabu naresh
  • 405
  • 1
  • 10