-1

I'm developing a web crawler in .Net C# that works like this.

Step1 Visits main page of the site (let's call this page Main.aspx)

Step2 Use httpwebrequest to get the form page (Let's call this page Form.aspx)

Step3 Post the form to another page and get the results. (Let's call this page Results.aspx)

It's pretty straight forward in terms of web crawling.

The current problem is, I can't access Form.aspx page if I dont set a bunch of cookies before. All of these cookies are javascript generated by Main.aspx.

Whenever i try to directly get the Form.aspx page, i get redirected to the Main page. The code that generates the cookies have more than 20kb and its aboslutelly messy and insane, also it uses a lot of "document." references which would block a simple attempt to use JINT or Javascript.net

So after a lot of research i found out that a headless browser would be what I'm looking for, tried a lot of them, but it seems a lot of complication. I already have a class library project with all my web crawlers in there, i just wanted another dll to make it work. Any suggestions?

I'm trying to be as clear as possible, if you have any doubt, please post on comments before giving negative votes...

John Saunders
  • 160,644
  • 26
  • 247
  • 397
  • 1
    could you show how you are trying to access the forms perhaps you are not setting and or creating an instance of the forms you are trying to consume – MethodMan Sep 16 '14 at 20:54
  • @Andrey, hello again. This question regresses on [your last one](http://stackoverflow.com/questions/25875672/i-want-to-have-a-html-page-javascript-interpreted) *(deleted, 10K+ users only, sorry)*, since at least you provide there enough information to understand the goal you want to achieve and the obstacles in your way (although you edited all that information out before deleting your question, I wonder why). I've said this before, and I'll do it again: **Regardless of your goals and the means you deem necessary to achieve them, questions like this are too broad for Stack Overflow**. – Frédéric Hamidi Sep 16 '14 at 21:22
  • @FrédéricHamidi Hello again. Now I got things more clear on my mind about all of this issue. In my researchs i found a lot of questions here ins stack overflow and now I have reached a good solution. I will just est it on a non ui project and post the results here. You may be a power user here, but that does not makes you the judge if a question is to hard or not for stack overflow. This site is really good, and I had always found good answers heres. Even the answer i got here it makes good sense, it just lacks a "how-to". I will try to post one though – Andrey Pereira Sep 16 '14 at 21:31
  • So why post this question then? – Frédéric Hamidi Sep 16 '14 at 21:32
  • I did not know the answer until i matched all pieces together with aikeru answer. Just wait and let me post the walk trough – Andrey Pereira Sep 16 '14 at 21:33

1 Answers1

-1

Use a .NET binding for PhantomJS, which is a headless webkit browser. You might consider going to a full-blown automation framework like Selenium, which is made for testing.

What you are asking for in not simple, though. You are asking for a lot of abstractions so that you can keep the amount of simplicity in your app that you have now.

If you didn't mind a "head-ful" browser, you could also use the Windows Forms "WebBrowser" control or remote control Internet Explorer through COM.

aikeru
  • 3,773
  • 3
  • 33
  • 48
  • Why the downvote? This matches the OP's question - a headless, NON UI browser for .NET. – aikeru Sep 16 '14 at 20:57
  • Thanks aikeru, its good to know theres especialists here. Your anwser was really good, i will just add a "how-to" to help people that was totally lost in it like i was 8 hours back from now. – Andrey Pereira Sep 16 '14 at 21:34
  • Sure. If you want to know how to use PhantomJS with .NET, I think that is a different question. Otherwise, I would try to answer it here. I wish I knew why people downvote the question 2x now! :( – aikeru Sep 16 '14 at 23:21