4

I have html source containing about 1000 microblogs (one tweet per line). Most of the tweets are like the below. I am using delphi memo to try to strip html marks by using Pos function and delete function but failed.

<div id='tweetText'> RT <a onmousedown="return touch(this.href,0)" href="http://twitter.com/HighfashionUK">@HighfashionUK</a> RT: Surprise goody bag up 4 grabs, Ok. <a onmousedown="return touch(this.href,0)" href="http://plixi.com/p/57846587">http://plixi.com/p/57846587</a> when we get 150</div>

I want to strip html marks and only have:

RT: Surprise goody bag up 4 grabs, Ok. http://plixi.com/p/57846587 when we get 150 

How can I extract such text in delphi?

Thank you very much in advance.

Update:

Cosmin Prund is right. I mistakenly skipped a part. What I want is :

RT @HighfashionUK  RT: Surprise goody bag up 4 grabs, Ok. http://plixi.com/p/57846587 when we get 150 

Cosmin Prund is great.

Community
  • 1
  • 1
Warren
  • 795
  • 1
  • 10
  • 19
  • Are you sure you can't just use the Twitter API? – Rob Kennedy Apr 19 '11 at 13:56
  • What exactly do you want to extract? You seem to only want text (ie: ignore all tags), but you skipped the inner-text of the first anchor tag (@HighfashionUK). Was that intentional or a mistake? – Cosmin Prund Apr 19 '11 at 14:00
  • 1
    Before closing for `dupe`: It's not a dupe if the OP wants to remove all HTML markup and only keep text. You don't need to parse HTML in order to do that. – Cosmin Prund Apr 19 '11 at 14:13

1 Answers1

8

Since all HTML markup is between < and >, a routine to strip markup can be trivially written like this. Hopefully this is what you want because, as you see in my comment, there's a issue with @HighfashionUK - your example skipped that, don't know why.

function StripHtmlMarkup(const source:string):string;
var i, count: Integer;
    InTag: Boolean;
    P: PChar;
begin
  SetLength(Result, Length(source));
  P := PChar(Result);
  InTag := False;
  count := 0;
  for i:=1 to Length(source) do
    if InTag then
      begin
        if source[i] = '>' then InTag := False;
      end
    else
      if source[i] = '<' then InTag := True
      else
        begin
          P[count] := source[i];
          Inc(count);
        end;
  SetLength(Result, count);
end;
Cosmin Prund
  • 25,498
  • 2
  • 60
  • 104
  • Thank you so much. See my update. but why you say "You don't need to parse HTML in order to do that". Are there other methods to do that? – Warren Apr 19 '11 at 15:25
  • are there html parsers (library/component) for delphi? – Warren Apr 19 '11 at 15:56
  • 1
    @Warren, an HTML parser would normally read *and understand* all HTML markup, producing an document tree, or a DOM (Document Object Model). You can traverse the tree and extract all text, but once you have a DOM you can also do smarter things, like extracting the "href" of anchors, or ignoring certain tags. My method doesn't have any HTML-specific knowledge, all it knows is that tags start with `<` and end with `>` and it's supposed to ignore everything between the markers. It's fast and effective, but less powerful then a full DOM parser. – Cosmin Prund Apr 21 '11 at 13:12
  • @Warren, I'm sure there are Delphi HTML parsers, and I'm also sure you can use non-Delphi specific parsers (example: you can load your HTML into an TWebBrowser and then use it's DOM) but I can't recommend any because I never used one. I did parse my share of HTML, but my "parsers" were hand-made and targeted to a specific web page. – Cosmin Prund Apr 21 '11 at 13:16
  • 1
    Besides removing text inside <..>, you will need to remove text found between certain tags: , – Guy Gordon May 14 '12 at 17:10
  • @GuyGordon, this is not a general-purpose parsing routine for html! There are no HTML comments and no code sections in the html snippet the OP needs to parse. – Cosmin Prund May 23 '12 at 08:11