1

I am on Windows and I am using the "Git for windows" tools in batch files. My etracted code from html site looks like this:

<a xmlns="http://www.w3.org/2000/svg" class="ZLl54 Dysyo" href="./g/git-for-windows/c/jgZ6P7bo7Fo"><div class="t17a0d"><span class="o1DPKc">[ANNOUNCE] Git for Windows 2.41.0</span></div><div class="WzoK">Dear Git users, I hereby announce that Git for Windows 2.41.0 is available from: https://</div></a>

and I want to extract /g/git-for-windows/c/jgZ6P7bo7Fo with sed or awk. The first part is always the same /g/git-for-windows/c/ but the ending of the url part differs.

What I did: sed 's/^.*\("./g/".*"><div\").*$/\1/' text.txt | tee text2.txt but it doesn't work.

What I want: I want to extract the upper most (always latest) link to a new release of "Git for Windows" from website https://groups.google.com/g/git-for-windows. The decription shows Announce. Here are my steps:

xidel https://groups.google.com/g/git-for-windows --printed-node-format html -e "//'Links:',//a" | tee text.txt

to get the website as text. Then I used cat text.txt | grep -F "announce" | head -1 | tee text1.txt. The result is the exctracted code I posted above.

My questions: How to use sed or awk correctly to extract the link /g/git-for-windows/c/jgZ6P7bo7Fo from the code? Or how to use xidel in a better way to get better extractable results in text file.

Thank you for your help.

Magoo
  • 77,302
  • 8
  • 62
  • 84
areich1976
  • 13
  • 3
  • Unless I'm not understanding you, if you were trying to retrieve the download URL for Git-For-Windows, you could do that very simply in PowerShell. Example for 64-bit version:```Invoke-RestMethod -Method Get -URI 'https://api.github.com/repos/git-for-windows/git/releases/latest' | ForEach-Object Assets | Where-Object Name -Like '*64-bit.exe' | Select-Object -ExpandProperty Browser_Download_URL```. For the 32-bit version you'd just change the `-Like` string to `'*32-bit.exe'`. You could save that to a variable then download it using ```Invoke-WebRequest -URI …``` too. – Compo Jun 17 '23 at 12:59
  • Obviously, if you wanted to do this from a batch file, to stay on topic, then you'd save it to a variable like this: ```@For /F "Delims=" %%G In ('%SystemRoot%\System32\WindowsPowerShell\v1.0\powershell.exe -NoProfile -Command "Invoke-RestMethod -Method Get -URI 'https://api.github.com/repos/git-for-windows/git/releases/latest' | ForEach-Object Assets | Where-Object Name -Like '*64-bit.exe' | Select-Object -ExpandProperty Browser_Download_URL" 2^>NUL') Do @Set "URL=%%G"```. Then use `%URL%` elsewhere in your code as required. – Compo Jun 17 '23 at 13:07
  • Good evening. Thank you for your answer. I don't want to download any asset from the Github repros, I just wanted to get a solution to get the link out of the code posted. For downloading latest or newest (pre-releases included) assets from Github, etc. I use the python script lastversion (https://github.com/dvershinin/lastversion). It also features a version comparing switch -gt. As I posted the question, my extracted html-code firstly got shown as html converted result. Magoo edited it so that the code is shown now ;-) – areich1976 Jun 17 '23 at 18:01

4 Answers4

2

You do not need to call so many tools

Everything can be selected with XPath alone:

  xidel https://groups.google.com/g/git-for-windows -e "//a[contains(., 'ANNOUNCE')]/@href"
Reino
  • 3,203
  • 1
  • 13
  • 21
BeniBela
  • 16,412
  • 4
  • 45
  • 52
  • Your answer is really perfect, too. Just tried it out. Yesterday I did my first steps using xidel and I didn't find an exercise for that. Maybe you know a good online ressource for learning a more deeper usage? My task is to get html pasing done by xidel as first solution. Please feel your answer upvoted, thank you. As I wrote Magoo I cannot upvote, too less reputation as beginner here ;-) – areich1976 Jun 17 '23 at 17:01
  • For the most helpful answer I have chosen Magoo, because a batch solution at the point I stumbled was my first question and a xidel all-in-all solution the second. But both of you have somehow earned the "most helpful solution"- Thx a lot! – areich1976 Jun 17 '23 at 17:06
  • I think `-e "//a[contains(div/span,'ANNOUNCE')]/@href"` would be better, as it prevents duplicates. – Reino Jun 17 '23 at 21:42
  • @areich1976 I'd say, have a look at https://github.com/benibela/xidel/issues/67. – Reino Jun 17 '23 at 21:44
  • @Reino thanks for the really good link for xidel an your add-on for the use – areich1976 Jun 18 '23 at 09:26
  • @Reino But perhaps that is too much as an introduction – BeniBela Jun 18 '23 at 11:59
  • For all readers wanting to use the xidel full solution: `xidel https://groups.google.com/g/git-for-windows -e "//a[contains(div/span,'ANNOUNCE')]/@href" | head -1 - | cut -c2-` Put it into a `FOR /F` loop (escape the pipes) to set it as variable or `echo/tee` it into a text file. The `head` command chooses the newest result and the `cut -c2-` eliminates the leading point. – areich1976 Jun 18 '23 at 12:02
  • @areich1976 I just searched around, and found a basic XPath game: https://topswagcode.com/xpath/# – BeniBela Jun 18 '23 at 12:03
  • @BeniBela Just try it out. Thank you for the link ;-) – areich1976 Jun 18 '23 at 12:14
  • @areich1976 If you simply want the first result, then there's no need for other tools (kinda strange to see Bash tools in this Windows command-line btw). Xidel can of course do this too: `-e "(//a[contains(div/span,'ANNOUNCE')])[1]/substring(@href,2)"`. – Reino Jun 18 '23 at 12:35
  • @Reino Thank you. xidel is really a great tool. I am only a hobby coder and learned on DOS in the late 80th, so I still use Windows cmd-programing. The Bash tools are shorter in pipes and really easy to use, they have a wide range of functions. Sometimes there are formatting issues. Regarding my simple needs they run perfect. ;-) – areich1976 Jun 18 '23 at 12:58
0

This would work:

curl https://...  | grep -E -o ">\[ANNOUNCE.{0,800}" | grep ">\[ANNOUNCE.*href" | sed 's/<\/span.*href="\.\([^"]*\).*/ \1/'
Renat
  • 7,718
  • 2
  • 20
  • 34
  • I get a "path is not valid" error. Can you please add some explanation to your solution? Thank you for your input and help. – areich1976 Jun 17 '23 at 17:08
  • Please try to replace `https://...` with valid url. I shortened command line to be more concise – Renat Jun 17 '23 at 17:14
  • 1
    Code works! Just had to escape the `^<` within the sed-command. On Windows use `curl https://groups.google.com/g/git-for-windows | grep -E -o ">\[ANNOUNCE.{0,800}" | grep ">\[ANNOUNCE.*href" | sed 's/^<\/span.*href="\.\([^"]*\).*/ \1/'` A bit late now but thank you @Renat for the grep-sed varant. I try to understand the {0,800} in the grep part. – areich1976 Jun 22 '23 at 10:40
0
@ECHO OFF
SETLOCAL
rem The following setting for the file is a name
rem that I use for testing and deliberately includes spaces to make sure
rem that the process works using such names. These will need to be changed to suit your situation.

SET "sourcedir=u:\your files"
SET "filename1=%sourcedir%\q76495893.txt"

SET "extracted="
FOR /f "usebackqdelims=" %%e IN ("%filename1%") DO (
 FOR %%o IN (%%e) DO (
  IF DEFINED extracted FOR /f "delims=<>" %%y IN ("%%o") DO SET "extracted=%%~y"&GOTO gotit
  IF "%%~o"=="href" SET "extracted=x"
 )
)
ECHO NOT found
GOTO :eof

:gotit
SET "extracted=%extracted:~1%"
ECHO extracted=%extracted%

GOTO :EOF

Since you tagged the post "batch"

Read the data from a file to %%e. Use standard list-processing of %%e to set %%o to each space-separated token in turn. When the href token is found, set extracted for use as a flag. When the next token arrives, use tokenising on the redirectors to grab the quoted string, and assign that, minus the quotes to extracted and done.

Well, almost. Need to remove the first character as you want the string minus the .

Magoo
  • 77,302
  • 8
  • 62
  • 84
  • Great, your solution works like a charm and has the exact output. Thank you @Magoo for your fast answer. To be honest, firstly I needed some time to understand your formula ;-) But I learn by doing. PS: The system says I cannot upvote because of my low reputation, but feel it upvoted. – areich1976 Jun 17 '23 at 16:53
0

Based upon you already having the shown string as the content of a file named text1.txt, then a batch file could retrieve the required substring like this:

@Set /P "URL=" 0<"text1.txt"
@For /F Tokens^=2^ Delims^=^" %%G In ("%URL:*href=%") Do @Set "URL=%%~G"
@Echo %URL:~1%

How it works:

  1. Save the first line of text1.txt as the content of a variable named URL.
  2. Expand that variable, replacing everything up to and including the first instance of the string href with nothing, (="./g/git-for-windows/c/jgZ6P7bo7Fo"><div class="t17a0d"><span class="o1DPKc">[ANNOUNCE] Git for Windows 2.41.0</span></div><div class="WzoK">Dear Git users, I hereby announce that Git for Windows 2.41.0 is available from: https://</div></a> ). Then delimit it by doublequotes, asking for the second token, (= being the first). This results in the full URL only, (./g/git-for-windows/c/jgZ6P7bo7Fo), overwriting the initial variable value.
  3. Expand the resulting variable skipping the first character, %URL:~1%.

If you just wanted the ending part, then:

@Set /P "URL=" 0<"text1.txt"
@For /F Tokens^=2^ Delims^=^" %%G In ("%URL:*href=%") Do @Echo %%~nxG
Compo
  • 36,585
  • 5
  • 27
  • 39
  • Thankx @Compo, it also works perfect. :-) It was my fault, I did not pushd'd within cmd to the correct dir before testing. A beginner's failure from me. – areich1976 Jun 18 '23 at 09:59