How to extract text from such type of html source?

Question

I have html source containing about 1000 microblogs (one tweet per line). Most of the tweets are like the below. I am using delphi memo to try to strip html marks by using Pos function and delete function but failed.

<div id='tweetText'> RT <a onmousedown="return touch(this.href,0)" href="http://twitter.com/HighfashionUK">@HighfashionUK</a> RT: Surprise goody bag up 4 grabs, Ok. <a onmousedown="return touch(this.href,0)" href="http://plixi.com/p/57846587">http://plixi.com/p/57846587</a> when we get 150</div>

I want to strip html marks and only have:

RT: Surprise goody bag up 4 grabs, Ok. http://plixi.com/p/57846587 when we get 150

How can I extract such text in delphi?

Thank you very much in advance.

Update:

Cosmin Prund is right. I mistakenly skipped a part. What I want is :

RT @HighfashionUK  RT: Surprise goody bag up 4 grabs, Ok. http://plixi.com/p/57846587 when we get 150

Cosmin Prund is great.

What exactly do you want to extract? You seem to only want text (ie: ignore all tags), but you skipped the inner-text of the first anchor tag (@HighfashionUK). Was that intentional or a mistake? — Cosmin Prund, Apr 19 '11 at 14:00
Before closing for `dupe`: It's not a dupe if the OP wants to remove all HTML markup and only keep text. You don't need to parse HTML in order to do that. — Cosmin Prund, Apr 19 '11 at 14:13

score 8 · Accepted Answer · answered Apr 19 '11 at 14:10

8

Since all HTML markup is between < and >, a routine to strip markup can be trivially written like this. Hopefully this is what you want because, as you see in my comment, there's a issue with @HighfashionUK - your example skipped that, don't know why.

function StripHtmlMarkup(const source:string):string;
var i, count: Integer;
    InTag: Boolean;
    P: PChar;
begin
  SetLength(Result, Length(source));
  P := PChar(Result);
  InTag := False;
  count := 0;
  for i:=1 to Length(source) do
    if InTag then
      begin
        if source[i] = '>' then InTag := False;
      end
    else
      if source[i] = '<' then InTag := True
      else
        begin
          P[count] := source[i];
          Inc(count);
        end;
  SetLength(Result, count);
end;

answered Apr 19 '11 at 14:10

Cosmin Prund

25,498
2
60
104

Thank you so much. See my update. but why you say "You don't need to parse HTML in order to do that". Are there other methods to do that? – Warren Apr 19 '11 at 15:25
are there html parsers (library/component) for delphi? – Warren Apr 19 '11 at 15:56
1

@Warren, an HTML parser would normally read *and understand* all HTML markup, producing an document tree, or a DOM (Document Object Model). You can traverse the tree and extract all text, but once you have a DOM you can also do smarter things, like extracting the "href" of anchors, or ignoring certain tags. My method doesn't have any HTML-specific knowledge, all it knows is that tags start with `<` and end with `>` and it's supposed to ignore everything between the markers. It's fast and effective, but less powerful then a full DOM parser. – Cosmin Prund Apr 21 '11 at 13:12
@Warren, I'm sure there are Delphi HTML parsers, and I'm also sure you can use non-Delphi specific parsers (example: you can load your HTML into an TWebBrowser and then use it's DOM) but I can't recommend any because I never used one. I did parse my share of HTML, but my "parsers" were hand-made and targeted to a specific web page. – Cosmin Prund Apr 21 '11 at 13:16
1

Besides removing text inside <..>, you will need to remove text found between certain tags: , – Guy Gordon May 14 '12 at 17:10
@GuyGordon, this is not a general-purpose parsing routine for html! There are no HTML comments and no code sections in the html snippet the OP needs to parse. – Cosmin Prund May 23 '12 at 08:11

How to extract text from such type of html source?

1 Answers1

Linked