3

Is it possible to parse data from web html page using windows batch?

let's say I have a web page: www.domain.com/data/page/1 Page source html:

...
<div><a href="/post/view/664654"> ....
....

In this case I would need get /post/view/664654 from web page.

My idea is to loop through www.domain.com/data/page/1 ... # (to some given number) and extract all the /post/view's. Then I would have a list of links, and from each of those links I would extract href values (either images or videos).

So far I was only successful in downloading image or video when I know exact link, using wget. But I don't know how (if possible at all) to parse html data.

edit

<body>
<nav>
    <section>links I dont need</section>
</nav>
<article>
    <section>links I need</section>
</article>

CrazySabbath
  • 1,274
  • 3
  • 11
  • 33

2 Answers2

2

It's better to parse structured markup as a hierarchical object, rather than scraping as flat text. That way you aren't so dependent upon the formatting of the data you're parsing (whether it's minified, spacing has changed, whatever).

The batch language isn't terribly well-suited to parse markup language like HTML, XML, JSON, etc. In such cases, it can be extremely helpful to use a hybrid script and borrow from JScript or PowerShell methods to scrape the data you need. Here's an example demonstrating a batch + JScript hybrid script. Save it with a .bat extension and give it a run.

@if (@CodeSection == @Batch) @then
@echo off & setlocal

set "url=http://www.domain.com/data/page/1"

for /f "delims=" %%I in ('cscript /nologo /e:JScript "%~f0" "%url%"') do (
    rem // do something useful with %%I
    echo Link found: %%I
)

goto :EOF
@end // end batch / begin JScript hybrid code

// returns a DOM root object
function fetch(url) {
    var XHR = WSH.CreateObject("Microsoft.XMLHTTP"),
        DOM = WSH.CreateObject('htmlfile');

    XHR.open("GET",url,true);
    XHR.setRequestHeader('User-Agent','XMLHTTP/1.0');
    XHR.send('');
    while (XHR.readyState!=4) {WSH.Sleep(25)};
    DOM.write('<meta http-equiv="x-ua-compatible" content="IE=9" />');
    DOM.write(XHR.responseText);
    return DOM;
}

var DOM = fetch(WSH.Arguments(0)),
    links = DOM.getElementsByTagName('a');

for (var i in links)
    if (links[i].href && /\/post\/view\//i.test(links[i].href))
        WSH.Echo(links[i].href);
rojo
  • 24,000
  • 5
  • 55
  • 101
  • Unfortunately it's not working as expted. Web page has ~30 links, like: `href="/post/view/1234#search=SearchString"`. Script extracts only 6 and all of them are wrong, example: `/post/view/141143#c63445`. – CrazySabbath Apr 06 '16 at 17:43
  • Maybe the content of the page is different based on whether you're logged in or not? I didn't code cookie management or login session handling. – rojo Apr 06 '16 at 17:45
  • No difference between being logged in and not. – CrazySabbath Apr 06 '16 at 17:46
  • What if you take out the regex test for `\/post\/view` and just do `if (links[i].href) WSH.Echo(links[i].href)`? Or maybe it's a user agent thing, and the web server is degrading to a mobile view because the user agent is unrecognized? Try changing the user agent to a Firefox user agent string. – rojo Apr 06 '16 at 17:49
  • it would see it does not extract links from html
    tag. No idea why. See my edit, script extract's links I dont need (outside article)
    – CrazySabbath Apr 06 '16 at 17:54
  • Maybe it's because the html engine of `htmlfile` is defaulting to something really old. I edited my code to force the `htmlfile` COM object into IE9 compatibility. Try the edit and see what happens. If it still doesn't work, try changing `"IE=9"` to `"IE=11"` and see what happens. – rojo Apr 06 '16 at 18:00
-1

If you just need to get /post/view/664654, you can use grep command, e.g.

grep -o '/post/view/[^"]\+' *.html

For parsing more complex HTML, you can use HTML-XML-utils or pup.

kenorb
  • 155,785
  • 88
  • 678
  • 743