2

Each answer or comment on a StackOverflow question thread has a unique URL. How can we use that URL with Invoke-WebRequest (or other tool) to capture just the contents of that answer or comment in mini-Markdown, and from that, some useful information?

Some answers contain complete scripts that I would soemtimes like to automate the retrieval of into .ps1 files on various systems. For example, given this URL https://superuser.com/questions/176624/linux-top-command-for-windows-powershell/1426271#1426271 , I would like to grab just the PowerShell code portion and pipe that into a file called mytop.ps1.

YorSubs
  • 3,194
  • 7
  • 37
  • 60
  • 3
    It looks like the Markdown source is _not_ part of the rendered HTML source code. If you're looking just for the code, you can try to scrape `` elements. – mklement0 Nov 10 '22 at 23:18
  • I have updated the URL to the correct one (the previous only viewed the 'revisions' page. Ah, so the Markdown is not part of the source, that's unfortunate, as it might make it easier to capture information. – YorSubs Nov 11 '22 at 06:06
  • 4
    Stack Exchange has an API, so you probably want to use that instead of scraping the HTML page. E.g. [answers-by-ids](https://api.stackexchange.com/docs/answers-by-ids) will give you the markdown for an answer (you have to set the filter to include it). – jkiiski Nov 11 '22 at 07:29

1 Answers1

3

You may use StackExchange REST API to pull the question, in particular answers-by-id.

It still doesn't give you the markdown, but it will be easier to drill down to the answer's body using the JSON response instead of parsing the full page source. Actually I think that it outputs HTML for the answer body is even better than markdown, because you consistently get <code> elements instead of having to parse all the different ways code can be formatted using markdown (e. g. code fences and indentation).

$answer = Invoke-RestMethod 'https://api.stackexchange.com/2.3/answers/1426271?site=superuser&filter=withbody'

$codes = [RegEx]::Matches( $answer.items.body, '(?s)<code>(.*?)</code>' ).ForEach{ $_.Groups[1].Value }

# This gives you the PowerShell script for this particular answer only!
$codes[6]

As there can be multiple <code> elements, you might want to use heuristics to determine the one that contains the PowerShell script, e. g. sort by length and check if the code consists of multiple lines.

zett42
  • 25,437
  • 3
  • 35
  • 72