Filter output of `man` command with PHP for headers, when the text is in a strange/unknown encoding

Question

In Output of `man` does not match an apparently-identical string literal, I solved the problem by doing a hexdump() of the string, copying the hexdump output to my code as a string literal, and doing a comparison against that. But what if I'd like to match all lines that only contain capital letters (I am trying to extract all the headers of the man page, not just "NAME", but also "SYNOPSIS", "DESCRIPTION"...)

The following do not work. They didn't filter out the headings I wished to get:

$matches = array();
$ans = preg_match("/^[A-Z]+$/u", $text, $matches);
// then filter out all the lines where $ans = 1;

//OR:
$matches = array();
$ans = preg_match("/\s/u", $text, $matches);
// then filter out all the lines where $ans = 0, since headers do not have whitespace;

How do I do this? Should I try to convert the strings in each line into ASCII and/or UTF-8 first, then try to match? But I tried this, and it didn't work too:

$text = iconv(mb_detect_encoding($text, mb_detect_order(), true), "ASCII", $text);
// and then use the filtering code given above

What should I do?

(Also, what encoding could these strings be possibly in? And why is the man output in such an encoding?)

There's a lot of useful stuff about the formatting of man pages and regular expressions here: https://stackoverflow.com/questions/56722611/grep-not-matching-certain-parts-of-man-page — Andy Preston, Mar 16 '23 at 11:58

Filter output of `man` command with PHP for headers, when the text is in a strange/unknown encoding

0 Answers0