0

I have the following raw content in a file. I am trying just print the list of all urls. I have kind of wrote some script. Getting content (reading) from the file and using ForEach line in lines - but do not know how to filter just the Url from the content. Any thoughts ?

Line 18942:         "url": "http://harvardpolitics.com/tag/brussels/",
Line 18994:         "url": "http://203.36.101.164/4f64555b4217b47b7c64b3fec19e389b/1502455203/Telstra/Foxtel-Vod/fxmultismvod5256/store2/ON307529/ON307529_hss.ism/QualityLevels(791000)/Fragments(video=9900000000)"
Line 19044:         "url": "https://www.gucci.com/int/en/ca/women/handbags/womens-shoulder-bags-c-women-handbags-shoulder-bags?filter=%3ANewest%3Acolors%3AGold%7Ccb9822",
Line 19096:         "url": "https://bagalio.cz/batohy-10l?cat=3p%3D1urceni%3D2582p%3D1kapsa_ntb_velikost%3D2179p%3D1manufacturer%3D1302p%3D1color%3D84p=1kapsa_ntb_velikost=2192",
Line 19148:         "url": "http://www.csillagjovo.gportal.hu/gindex.php?pg=31670155",
Line 19200:         "url": "http://www.copiersupplystore.com/hp/color-laserjet-4700dn/j7934a-j7934ar",
user1911509
  • 133
  • 4
  • 13
  • 1
    Where do the line numbers come from - are they in the file or are they your addition to it? It looks like part of a JSON file - If so, use `ConvertFrom-Json`. – TessellatingHeckler Aug 11 '17 at 23:54
  • Absolutely true, they are the response from an API as JSON blob. I have them filtered in Notepad++ with "url" and a list of around 400 urls showed up. I tried to parse them nothing was working. I will try with ConvertFrom-Json and see if it works. – user1911509 Aug 12 '17 at 20:14
  • 1
    `Invoke-RestMethod` will implicitly convert API responses from JSON into PowerShell objects, btw, instead of `Invoke-WebRequest` – TessellatingHeckler Aug 12 '17 at 22:24
  • Invoke-RestMethod worked and it did come in handy and a better solution than Invoke-WebRequest. Appreciate your help. – user1911509 Aug 13 '17 at 14:58

3 Answers3

2

One way could be the substring method another version could be some regex.

$Text = Get-Content D:\Test\test.txt
foreach ($Line in $Text) {
    # SubString Version
    $FirstIndex = $Line.IndexOf('http')
    $URLLength = ($Line.LastIndexOf('"') - $FirstIndex)
    $Line.Substring($FirstIndex, $URLLength)

    # Regex Version 
    $Regex = '(http[s]?|[s]?ftp[s]?)(:\/\/)([^\s,]+)'
    ([regex]::Matches($Line,$Regex)).Value.TrimEnd('"')([^\s,]+)')).Value.TrimEnd('"')
}
Olaf Reitz
  • 684
  • 3
  • 10
2

Try this out to just get the urls:

$content = Get-Content <file-with-output> # or other way of getting the data

$urls = $content | ForEach-Object { ($_ -replace ".+?(?=http.+)","").Trim('",')}

Edit: Added $urls to catch result.

KarlGdawg
  • 351
  • 1
  • 4
  • 9
  • 3
    Just throwing in `$_ -replace '^.*(http[^"]+).*$', '$1'` as a simpler regex approach (no lookaround, no trim) – TessellatingHeckler Aug 11 '17 at 23:57
  • 1
    My regex is a bit weak, thank you for showing me a better way. – KarlGdawg Aug 12 '17 at 00:06
  • I tried the regex but it only outputs one Urls with line 19200. Is it something with the data copied over to the file. As I have mentioned above "response from an API as JSON blob. I have them filtered in Notepad++ with "url" and a list of around 400 urls showed up. I tried to parse them nothing was working." - I will also try with 'ConvertFrom-Json'. – user1911509 Aug 12 '17 at 20:24
  • Thank you all, I have used the Convert-Json and all worked fine. All the above solutions worked well for parsing Urls and output to a file. I appreciate your help in resolving this. – user1911509 Aug 13 '17 at 14:56
2
$Urls = Get-Content file.txt | ForEach-Object { $_.Split('"')[3] }
TessellatingHeckler
  • 27,511
  • 4
  • 48
  • 87