2

This is the relevant part of XS, which should convert an Perl string from UTF-8 to codepoints (unsigned 32-bit integers):

UV *
text2UV (SV *sv, STRLEN *lenp)
{
  STRLEN len;
  // char *str = SvPV(foo_sv, strlen);
// char *s =       SvPV (sv, len); // This original version warns
  U8 *s    = (U8 *)SvPV (sv, len); // This casts without warning
  UV *r = (UV *)SvPVX (sv_2mortal (NEWSV (0, (len + 1) * sizeof (UV))));
  UV *p = r;

  if (SvUTF8 (sv))
    {
       STRLEN clen;
       while (len)
         {
         // UV  utf8_to_uvchr_buf(const U8 *s, const U8 *send, STRLEN *retlen)
           *p++ = utf8n_to_uvchr (s, len, &clen, 0);

           if (clen < 0)
             croak ("illegal unicode character in string");

           s += clen;
           len -= clen;
         }
    }
  else
    while (len--)
      *p++ = *(unsigned char *)s++;

  *lenp = p - r;
  return r;
}

It throws this warning:

~/github/perl/Text-Levenshtein-BVXS$ make
cp BVXS.pm blib/lib/Text/Levenshtein/BVXS.pm
Running Mkbootstrap for BVXS ()
chmod 644 "BVXS.bs"
"/Users/helmut/perl5/perlbrew/perls/perl-5.32.0/bin/perl" -MExtUtils::Command::MM -e 'cp_nonempty' -- BVXS.bs blib/arch/auto/Text/Levenshtein/BVXS/BVXS.bs 644
"/Users/helmut/perl5/perlbrew/perls/perl-5.32.0/bin/perl" "/Users/helmut/perl5/perlbrew/perls/perl-5.32.0/lib/5.32.0/ExtUtils/xsubpp"  -typemap '/Users/helmut/perl5/perlbrew/perls/perl-5.32.0/lib/5.32.0/ExtUtils/typemap'  BVXS.xs > BVXS.xsc
mv BVXS.xsc BVXS.c
cc -c  -I. -fno-common -DPERL_DARWIN -mmacosx-version-min=10.14 -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include -I/opt/local/include -DPERL_USE_SAFE_PUTENV -O3   -DVERSION=\"0.01\" -DXS_VERSION=\"0.01\"  "-I/Users/helmut/perl5/perlbrew/perls/perl-5.32.0/lib/5.32.0/darwin-2level/CORE"   BVXS.c
BVXS.xs:26:35: warning: passing 'char *' to parameter of type 'const U8 *' (aka 'const unsigned char *') converts between pointers to integer types with different sign [-Wpointer-sign]
           *p++ = utf8n_to_uvchr (s, len, &clen, 0);
                                  ^
/Users/helmut/perl5/perlbrew/perls/perl-5.32.0/lib/5.32.0/darwin-2level/CORE/utf8.h:74:54: note: expanded from macro 'utf8n_to_uvchr'
                                utf8n_to_uvchr_error(s, len, lenp, flags, 0)
                                                     ^
/Users/helmut/perl5/perlbrew/perls/perl-5.32.0/lib/5.32.0/darwin-2level/CORE/utf8.h:76:45: note: expanded from macro 'utf8n_to_uvchr_error'
                        utf8n_to_uvchr_msgs(s, len, lenp, flags, errors, 0)
                                            ^
/Users/helmut/perl5/perlbrew/perls/perl-5.32.0/lib/5.32.0/darwin-2level/CORE/inline.h:1781:36: note: passing argument to parameter 's' here
Perl_utf8n_to_uvchr_msgs(const U8 *s,
                                   ^
1 warning generated.
rm -f blib/arch/auto/Text/Levenshtein/BVXS/BVXS.bundle
cc  -mmacosx-version-min=10.14 -bundle -undefined dynamic_lookup -L/usr/local/lib -L/opt/local/lib -fstack-protector-strong  BVXS.o  -o blib/arch/auto/Text/Levenshtein/BVXS/BVXS.bundle  \
          \

It works and passes my tests. But if I want to deliver it to CPAN the distribution should not throw warnings.

Decode it with own code in C would be a work-around (and faster).

For me it looks like a bug in the XS macros and/or the example in the documentation are wrong.

  • Re "*(and faster)*", Why would your C algorithm be faster than the C algorithm in Perl that does the same thing? – ikegami Feb 14 '22 at 18:04
  • Re "*the example in the documentation are wrong.*" To what example are you referring? There's no example in the docs for `utf8_to_uvchr_buf` or `utf8_to_uvchr`. – ikegami Feb 14 '22 at 18:05
  • Why are you using a function whose docs start with "**`DEPRECATED!`** It is planned to remove `utf8_to_uvchr` from a future release of Perl. Do not use it for new code; remove it from existing code." You even mention the correct function in the preceding comment. – ikegami Feb 14 '22 at 18:07
  • Re "*For me it looks like a bug in the XS macros*", No, you are passing a `char *` to a macro/function documented to expect `U8 *` aka `const unsigned char *` is expected. `char` may be a signed type. This warning is therefore expected from your code, not a bug. – ikegami Feb 14 '22 at 18:08
  • I can't think of a situation where a cast to `U8 *` wouldn't do the right thing. No idea why it expects `U8` rather than `char`. – ikegami Feb 14 '22 at 18:12
  • @ikegami Thanks for your comments. Exactly this was my question: Why do they return `char ` (signed), if they expect `U8 *` in all related, consuming macros? Answered my question myself and edited the code with comments. Now works without warnings. – Helmut Wollmersdorfer Feb 14 '22 at 19:47
  • Pure C would be faster, because I can omit checks. IMHO a library routine should not check UTF-8 again. Besides I can work on UTF-8 tokens without decoding, which is 3.3 times faster. – Helmut Wollmersdorfer Feb 14 '22 at 20:01
  • `utf8_to_uvchr_buf` IS pure C. Doesn't even work with scalars. – ikegami Feb 14 '22 at 21:29
  • Re "*I can work on UTF-8 tokens without decoding, which is 3.3 times faster.*", The whole point of your function is decoding. So this is a very confusing comment to make. How does it related to your question or anything else? – ikegami Feb 14 '22 at 21:30
  • 1
    Re "*IMHO a library routine should not check UTF-8 again.*", Perl doesn't use UTF-8; it uses non-standard utf8. And whole lot of people build corrupt scalars or at least use code that can (e.g. `open my $fh, "<:utf8", ...` instead of `open my $fh, "<:encoding(UTF-8)", ...`) – ikegami Feb 14 '22 at 21:32
  • Re my last comment, see [`:encoding(UTF-8)` vs `:encoding(utf8)` vs `:utf8`](https://stackoverflow.com/a/49040165/589924) – ikegami Feb 28 '22 at 20:59
  • @ikegami Sure, I know the difference between `UTF-8` (1-4 bytes) and `utf8` (1-6 bytes). At the moment [Text-Levenshtein-BVXS](https://github.com/wollmers/Text-Levenshtein-BVXS) I convert only 1-4 bytes in pure C. Without checks this is fast. – Helmut Wollmersdorfer Mar 24 '22 at 08:39
  • You completely misunderstood. I said people create corrupt scalars. That has nothing to do with the differences between utf8 and UTF-8. I never even mentioned the differences between utf8 and UTF-8?!?! I also never said anything about skipping the checks being slower?!?! Why did you come back a month later with this bonkers comment? – ikegami Mar 24 '22 at 14:39

1 Answers1

0

The interplay of U8 and char in the API is a bit weird. You might ask #p5p to see why it works that way.

Failing that, though, would some plain typecasting suppress the warnings? Is this in a public repository somewhere?

Aside: SvPV is evil. Its prevalence in XS modules causes quite a lot of pain. Avoid it when possible. See: https://dev.to/fgasper/perl-s-svpv-menace-5515

Update: This looks to be a case where it’s necessary to break the abstraction. Alas.

  • Also, this post is completely wrong. The OP is using `SvPV` correctly. This is extremely rare, so your mistake is understandable. But it's a mistake nonetheless. The article suggests you should use `SvPVbyte` (inapplicable here) or `SvPVutf8` (applicable here), but there's a third option: Using `SvPV` plus two paths based on `SvUTF8`. It's double the code, but it's more efficient (for some inputs) . The OP, obviously trying to optimize speed, opted for this approach. – ikegami Feb 28 '22 at 20:54
  • @ikegami: It is not possible to use SvPV correctly any more than it’s possible to use bytes.pm correctly. Both of these violate Perl’s string abstraction and are fundamentally wrong. That’s not to say that avoiding SvPV fixes all ills, of course. (I updated my comment to indicate its being off-topic.) – PeregrineYankee Mar 01 '22 at 14:33
  • You are mistaken, as I've already explained. The OP's code uses SvPV without "violating Perl’s string abstraction". The commonly-used term for such code is code that "suffers from The Unicode Bug", and the OP's code doesn't. Unlike DBI, `open`, and other code suffering from The Unicode Bug, the OP's code produces exactly the same result for `utf8::upgrade( $_ = chr(0xE9) )` and for `utf8::downgrade( $_ = chr(0xE9) )`. Again, the key is that the OP uses SvPV+SvUTF8, not just SvPV. More specifically, SvPV+SvUTF8 is being used in a way that's equivalent to SvUTF8 ? SvPVutf8 : SvPVbyte. – ikegami Mar 01 '22 at 16:27
  • Strictly speaking, SvPV _cannot_ be used without violating the abstraction. Neither can SvUTF8*, for that matter. The point of the abstraction is that the string stores code points in an undefined encoding, and things that aren’t Perl internals—whether that’s Perl code or C API callers—shouldn’t worry about how Perl stores the code points. To preserve the abstraction the caller should call SvPVutf8 and then parse the UTF-8 itself, but of course that would entail a performance hit for downgraded strings, so in this case SvUTF8 is appropriate, lamentably. – PeregrineYankee Mar 01 '22 at 20:16
  • Re "*Strictly speaking, SvPV cannot be used without violating the abstraction.*", That's completely false. Re "*shouldn’t worry about how Perl stores the code point*", nonesense. That's the entire point of XS. It used to convert to and from scalars in order to interface with non-Perl code. This includes supporting the different string formats (UTF8=0, UTF8=1) and the different number formats (IV, UV, NV). By definition, it's can't be an abstraction violation. And we've already covered it's not The Unicode Bug, which the linked document is all about. – ikegami Mar 01 '22 at 23:12
  • (The complete lack of mention of the standard `char*` typemap in the linked doc is quite weird.) – ikegami Mar 01 '22 at 23:12
  • ikegami: The point of XS is merely to interface with C code. Perl’s C API is just that: an API. It is not Perl internals. This API does indeed offer controls to poke at SV internals, alas, but it’s more ideal not to do that, IMO, just as a matter of robustness. In this case, the caller might as well just call SvPVutf8. There’ll be some performance hit if Perl stores the string downgraded (i.e., “Latin-1”/non-UTF8-flagged), but since this is Unicode-aware code that seems relatively unlikely. – PeregrineYankee Mar 04 '22 at 14:19
  • If the linked doc you reference is the “SvPV Menace” one, it does indeed discuss `char*` in the default typemap. See “How did this come to be?” – PeregrineYankee Mar 04 '22 at 14:20
  • Let me rephrase. It doesn't mention that using `char*` is just as bad. It said it searched other distros for `SvPV`, but doesn't mentioned anything about searching for `char*`, etc. – ikegami Mar 04 '22 at 17:58
  • Re "*The point of XS is merely to interface with C code*", That is what I said. // Re "*It is not Perl internals.*", I never said it was // Re "*This API does indeed offer controls to poke at SV internals*", Building a scalar is not "poking at internals". Getting values from a scalar is not "poking at internals". Not in a general sense, and not in an official sense. This is a public API. And as I already said, they're essential parts of interfacing with C code. – ikegami Mar 04 '22 at 18:04
  • Re "*that seems relatively unlikely.*", 1) That's not true. A lot of strings are downgraded, even if Unicode aware apps. and 2) it's not relevant. You weren't arguing that SvPV was unnecessary for this program; you are arguing that it's wrong. – ikegami Mar 04 '22 at 18:04
  • In my experience, downgraded strings that are meant as Unicode are rare, for the simple reason that these strings have arrived into Perl as UTF-8, and so to decode them Perl simply sets the UTF8 flag. That doesn’t make it _wrong_ to have Unicode downgraded, but it’s rather unusual. Poking at SvUTF8 from the C API, while supported, is like reading utf8::is_utf8 from Perl. It’s a bad idea. This thread highlights a case where it’s advantageous for performance; I personally would regard that more as a lacuna in the API than as a desirable state of affairs. – PeregrineYankee Mar 09 '22 at 00:58
  • Re "*In my experience, downgraded strings that are meant as Unicode are rare,*", Quite the opposite! They're *extremely* common. Take for example `use utf8; my $x = "abc";`. That's why you can't use the UTF8 to infer semantics. /// Re "*Poking at SvUTF8 from the C API, while supported, is like reading utf8::is_utf8 from Perl. It’s a bad idea*", Quite the contrary. It's necessary. Using SvPV without that creates an instance of the Unicode Bug. That's really bad. The answer to which this comment is found links to an article about it. – ikegami Mar 09 '22 at 03:06
  • Note that even the article to which you linked disagrees with you: "For SvPV to be meaningful it has to be used in tandem with SvUTF8, a macro that tells you which form the PV is" – ikegami Mar 09 '22 at 03:10
  • @PeregrineYankee My question in the OP was about the warning, not how I get a string from Perl. Like @ikegami writes, I use `SvUTF8` to distinguish between bytes and `utf8`. Bytes can be Latin-1 (or anything). `utf8` can be `utf8` or `UTF-8`, which I decode to `uint32_t * codepoints[]` in C. – Helmut Wollmersdorfer Mar 24 '22 at 14:51
  • It can still be "bytes or latin-1 or anything" with SvUTF8 set. SvUTF8 doesn't doesn't indicate anything about the value of the string. Your code does the right thing, but your explanation is completely wrong. The code converts each character of a string into a uint32_t, and it does so correctly regardless of how that the character means or how it's stored. – ikegami Mar 24 '22 at 15:34
  • @ikegami Thanks, but now I am more confused. Thus this mean the Perl-api has no reliable way to get utf8- or byte-strings? What's the right way to distinguish them in C/XS? According to `perlapi` and [perlguts](https://perldoc.perl.org/perlguts#How-do-I-pass-a-Perl-string-to-a-C-library%3F) it's `SvUTF8()`. Or not? – Helmut Wollmersdorfer Mar 24 '22 at 16:02
  • No, it means that Perl has no way to infer semantics. Does the string contain decoded text? text encoded using UTF-8? text encoded using latin-1? Packed temperature readings? Perl has no way to know. Just like it can't infer if a number represents a sum, a code point or temperature reading based on whether it's stored in an IV, UV or NV. It's all just characters (numbers) to it. Certainly, most operators and subs have requirements on the value, but Perl has no way to know if the value conforms or not. – ikegami Mar 24 '22 at 16:57
  • @ikegami: Re rarity of downgraded strings as Unicode: By “downgraded” I meant storing 128-255 as Latin-1 internally. That *is* rare, for the reason I gave. Checking the UTF8 from C is *NOT* necessary; you can use SvPVbyte/SvPVutf8 to avoid it. – PeregrineYankee Mar 25 '22 at 15:06
  • @HelmutWollmersdorfer: Ideally you’d just use SvPVutf8 and parse all strings as UTF-8. There would be a performance hit for downgraded code points in the 128-255 range, but as I wrote earlier, that’s rare. perlapi & perlguts in blead (soon to be 5.36) have modifications that discourage use of SvPV for reasons such as I’ve given in this thread. – PeregrineYankee Mar 25 '22 at 15:08
  • Re "*that’s rare.*", Except it's not. It's extremely extremely common. Even when using `use utf8;`. I already pointed this out. For example, `use utf8; my $x = "abc";` and `no utf8; my $x = "abc";` produce exactly the same var. I mean both externally and internally. Neither have `SVf_UTF8` set. – ikegami Mar 25 '22 at 15:17
  • Also, rarity is also only half of the equation. You also need the size of the cost differences between the two situations. The bigger the difference in costs, the less rarity matters. – ikegami Mar 25 '22 at 15:21
  • @PeregrineYankee "downgraded" to Latin-1 (bytes) isn't rare. There is e.g. still code (also on CPAN) with source files in a legacy 8-bit character encoding. As my C-code has no character semantics I can treat this case as `uint8_t`. – Helmut Wollmersdorfer Mar 25 '22 at 16:08
  • @ikegami: Again, I’m talking about the 128-255 range. "abc" isn’t in that range. And yes, for performance reasons it’s justified—here—to look at the UTF8 flag manually since Perl’s API provides no way to fetch the code point at a given character offset in the string. As I wrote above, this is an unfortunate lacuna in the API. – PeregrineYankee Mar 25 '22 at 19:09
  • Re "*Again, I’m talking about the 128-255 range.*", Aside from being your the first time you mention, it's absolutely not true. We're talking about using SvPVutf8 unconditionally instead of using SvUTF8. ("*Ideally you’d just use SvPVutf8 and parse all strings as UTF-8.*") That has nothing to do "with the 128-255 range", whatever that means. And that would affect the strings in my examples. – ikegami Mar 25 '22 at 19:12
  • @HelmutWollmersdorfer: I don’t doubt that CPAN stores Latin-1 source code that stores characters 128-255. Can you cite 3 examples, though (without looking for them specifically)? In any modern application your Unicode input is going to start out UTF-8, and Perl is going to leave it as UTF-8 (internally) when it decodes. Generally speaking, XS modules that need Unicode won’t downgrade, either. – PeregrineYankee Mar 25 '22 at 19:17
  • Re "*In any modern application your Unicode input is going to start out UTF-8,*", For the third time, that's absolutely false. It's extremely common to have downgraded strings in a Unicode-aware program. – ikegami Mar 25 '22 at 19:19
  • @ikegami 128-255 refers to code points, of course. Those are the non-ASCII code points that Latin-1 can express but for which UTF-8 uses 2 bytes; this range is, thus, the crux of “the Unicode bug”. Using SvPVutf8 when you want to parse a Perl string as Unicode will always lead you to the right place, and more simply than parsing 2 different internal encodings. – PeregrineYankee Mar 25 '22 at 19:20
  • What in the world makes you think I didn't know all that?! – ikegami Mar 25 '22 at 19:21
  • @ikegami: The examples you provided are cases where upgraded/downgraded makes no difference insofar as the internal storage. 128-255 is the critical range. – PeregrineYankee Mar 25 '22 at 19:21
  • Not so. There's most definitely a performance difference for working in UTF-8 even if the string contains only ASCII-ranged chars. – ikegami Mar 25 '22 at 19:22
  • @ikegami: “It's extremely common to have downgraded strings in a Unicode-aware program.” If you mean all-ASCII strings, sure. If you mean 128-255, beg to differ. Would love to learn otherwise. – PeregrineYankee Mar 25 '22 at 19:24
  • I said what I meant. I didn't mention the value of any of the characters in the string because they are irrelevant. – ikegami Mar 25 '22 at 19:24
  • ah, I see you edited your earlier comment after I replied to it and are now pretending like it was like that all along, so let me clarify "There would be a performance hit for downgraded code points in the 128-255 range" is incorrect. It should be "There would be a performance hit for downgraded code points in the 0-255 range". And this is extremely common. Now reread the following conversation with that in mind. – ikegami Mar 25 '22 at 19:32
  • If the string is all ASCII but the UTF8 flag is off, Perl can just give the PV as SvPVutf8’s result. If the string has 128-255 but is downgraded, though, Perl has to upgrade the string in order to implement SvPVutf8. (I think Perl still sets UTF8 on in the all-ASCII case, though.) I guess if the cost of scanning the string in the all-ASCII case is prohibitive, then that changes the calculus. In a code review I’d ask for some profiling, though. – PeregrineYankee Mar 25 '22 at 19:41
  • Oh, right, there's that extra cost too. I didn't even mention the conversion cost. I had only mentioned the costs of treating the string as UTF-8 (inability to index, having to check each char for how many bytes it has, etc). This is on top of that. – ikegami Mar 26 '22 at 06:16