0

I have this code:

string firstTag = "Forums2008/forumPage.aspx?forumId=";
string endTag = "</a>";
index = forums.IndexOf(firstTag, index1);

if (index == -1)
   continue;

var secondIndex = forums.IndexOf(endTag, index);

result = forums.Substring(index + firstTag.Length + 12, secondIndex - (index + firstTag.Length - 50));

The string i want to extract from is for example:

<a href="/Forums2008/forumPage.aspx?forumId=317" title="הנקה">הנקה</a>

What i want to get is the word after the title only this: הנקה And the second problem is that when i'm extracting it i see instead hebrew some gibrish like this: ������

Sam Axe
  • 33,313
  • 9
  • 55
  • 89
  • 1
    Use the [HTML Agility Pack](http://htmlagilitypack.codeplex.com/) it makes parsing HTML and extrating stuff like the text in titles of links much much easier. – Scott Chamberlain Feb 28 '15 at 04:18
  • "And the second problem is that ..." You should probably post that as a separate question, and make it clear where you are seeing the gibberish, in Visual Studio debugger output or a console window or what. – RenniePet Feb 28 '15 at 05:15

2 Answers2

1

One powerful way to do this is to use Regular Expressions instead of trying to find a starting position and use a substring. Try out this code, and you'll see that it extracts the anchor tag's title:

    var input = "<a href=\"/Forums2008/forumPage.aspx?forumId=317\" title=\"הנקה\">הנקה</a>";

    var expression = new System.Text.RegularExpressions.Regex(@"title=\""([^\""]+)\""");

    var match = expression.Match(input);

    if (match.Success) {
        Console.WriteLine(match.Groups[1]);
    }
    else {
        Console.WriteLine("not found");
    }       

And for the curious, here is a version in JavaScript:

var input = '<a href="/Forums2008/forumPage.aspx?forumId=317" title="הנקה">הנקה</a>';

var expression = new RegExp('title=\"([^\"]+)\"');

var results = expression.exec(input);

if (results) {
    document.write(results[1]);
  }
else {
  document.write("not found");
}
sfuqua
  • 5,797
  • 1
  • 32
  • 33
  • Strange sfuqua i'm getting error on the line: var expression = new RegExp('title=\"([^\"]+)\"'); Error 3 Argument 1: cannot convert from 'char' to 'string' D:\C-Sharp\Tapuz Images\WindowsFormsApplication1\WindowsFormsApplication1\Form1.cs 181 44 WindowsFormsApplication1 and too many characters in character literal. – Daniel Shabos Feb 28 '15 at 18:06
  • Wow, I totally missed something important. I was thinking in JavaScript and replied in JavaScript instead of c#! I will fix that. – sfuqua Mar 01 '15 at 03:30
0

Okay here is the solution using String.Substring() String.Split() and String.IndexOf()

    String str = "<a href=\"/Forums2008/forumPage.aspx?forumId=317\" title=\"הנקה\">הנקה</a>"; // <== Assume this is passing string. Yes unusual scape sequence are added 

    int splitStart = str.IndexOf("title=");  // < Where to start splitting
    int splitEnd = str.LastIndexOf("</a>");  // < = Where to end

    /* What we try to extract is this :  title="הנקה">הנקה
     *  (Given without escape sequence)
     */

    String extracted = str.Substring(splitStart, splitEnd - splitStart); // <=Extracting required portion 

    String[] splitted = extracted.Split('"'); // < = Now split with "

    Console.WriteLine(splitted[1]);  // <= Try to Out but yes will produce ???? But put a breakpoint here and check the values in split array 

Now the problem, here you can see that i have to use escape sequence in an unusual way. You may ignore that since you are simply passing the scanning string.

And this actually works, but you cannot visualize it with the provided Console.WriteLine(splitted[1]);

But if you put a break point and check the extracted split array you can see that text are extracted. you can confirm it with following screenshot

Debugging for extracted text

Kavindu Dodanduwa
  • 12,193
  • 3
  • 33
  • 46
  • Strange KcDoD i tried your solution and what i get is: ������ ������� ��������� gibberish instead the hebrew words. When i'm going to the page i see the hebrew and english but when parsing i see this gibberish instead hebrew. I tried to work it out with this solution here: http://stackoverflow.com/questions/7236550/c-sharp-encoding-converting-latin-to-hebrew but it didn't work. What else can i do ? In the page i'm trying to parse from i see at the top charset=windows-1255 – Daniel Shabos Feb 28 '15 at 17:27
  • This is the page i'm trying to parse from the text can't figure out why i see this gibberish. view-source:http://www.tapuz.co.il/forums/forumsListNew.asp – Daniel Shabos Feb 28 '15 at 17:27
  • @Daniel Shabos Havr yo tried to debug ? I mean put a break point and check what the variables contains in the middle of conversion ?? – Kavindu Dodanduwa Mar 01 '15 at 02:02