4

I'm looking for a RegEx to return either the first [n] words in a paragraph or, if the paragraph contains less than [n] words, the complete paragraph is returned.

For example, assuming I need, at most, the first 7 words:

<p>one two <tag>three</tag> four five, six seven eight nine ten.</p><p>ignore</p>

I'd get:

one two <tag>three</tag> four five, six seven

And the same RegEx on a paragraph containing less than the requested number of words:

<p>one two <tag>three</tag> four five.</p><p>ignore</p>

Would simply return:

one two <tag>three</tag> four five.

My attempt at the problem resulted in the following RegEx:

^(?:\<p.*?\>)((?:\w+\b.*?){1,7}).*(?:\</p\>)

However, this returns just the first word - "one". It doesn't work. I think the .*? (after the \w+\b) is causing problems.

Where am I going wrong? Can anyone present a RegEx that will work?

FYI, I'm using .Net 3.5's RegEX engine (via C#)

Many thanks

Leigh Bowers
  • 707
  • 10
  • 22

3 Answers3

7

OK, complete re-edit to acknowledge the new "spec" :)

I'm pretty sure you can't do that with one regex. The best tool definitely is an HTML parser. The closest I can get with regexes is a two-step approach.

First, isolate each paragraph's contents with:

<p>(.*?)</p>

You need to set RegexOptions.Singleline if paragraphs can span multiple lines.

Then, in a next step, iterate over your matches and apply the following regex once on each match's Group[1].Value:

((?:(\S+\s+){1,6})\w+)

That will match the first seven items separated by spaces/tabs/newlines, ignoring any trailing punctuation or non-word characters.

BUT it will treat a tag separated by spaces as one of those items, i. e. in

One, two three <br\> four five six seven

it will only match up until six. I guess that regex-wise, there's no way around that.

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
0

I had the same problem and combined a few Stack Overflow answers into this class. It uses the HtmlAgilityPack which is a better tool for the job. Call:

 Words(string html, int n)

To get n words

using HtmlAgilityPack;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;


namespace UmbracoUtilities
{
    public class Text
    {
      /// <summary>
      /// Return the first n words in the html
      /// </summary>
      /// <param name="html"></param>
      /// <param name="n"></param>
      /// <returns></returns>
      public static string Words(string html, int n)
      {
        string words = html, n_words;

        words = StripHtml(html);
        n_words = GetNWords(words, n);

        return n_words;
      }


      /// <summary>
      /// Returns the first n words in text
      /// Assumes text is not a html string
      /// http://stackoverflow.com/questions/13368345/get-first-250-words-of-a-string
      /// </summary>
      /// <param name="text"></param>
      /// <param name="n"></param>
      /// <returns></returns>
      public static string GetNWords(string text, int n)
      {
        StringBuilder builder = new StringBuilder();

        //remove multiple spaces
        //http://stackoverflow.com/questions/1279859/how-to-replace-multiple-white-spaces-with-one-white-space
        string cleanedString = System.Text.RegularExpressions.Regex.Replace(text, @"\s+", " ");
        IEnumerable<string> words = cleanedString.Split().Take(n + 1);

        foreach (string word in words)
          builder.Append(" " + word);

        return builder.ToString();
      }


      /// <summary>
      /// Returns a string of html with tags removed
      /// </summary>
      /// <param name="html"></param>
      /// <returns></returns>
      public static string StripHtml(string html)
      {
        HtmlDocument document = new HtmlDocument();
        document.LoadHtml(html);

        var root = document.DocumentNode;
        var stringBuilder = new StringBuilder();

        foreach (var node in root.DescendantsAndSelf())
        {
          if (!node.HasChildNodes)
          {
            string text = node.InnerText;
            if (!string.IsNullOrEmpty(text))
              stringBuilder.Append(" " + text.Trim());
          }
        }

        return stringBuilder.ToString();
      }



    }
}

Merry Christmas!

Petras
  • 4,686
  • 14
  • 57
  • 89
0
  1. Use a HTML parser to get the first paragraph, flattening its structure (i.e. remove decorating HTML tags inside the paragraph).
  2. Search for the position of the nth whitespace character.
  3. Take the substring from 0 to that position.

edit: I removed the regex proposal for step 2 and 3, since it was wrong (thanks to the commenter). Also, the HTML structure needs to be flattened.

Svante
  • 50,694
  • 11
  • 78
  • 122
  • Inside a character class, \b matches a backspace character. Also, the problem definition seems to have been changed since you posted this; \w and \W aren't going to cut it. – Alan Moore May 07 '09 at 15:14