0

SBLineEntry is a proxy object in LLDB Python interface. SBLineEntry.GetColumn() returns point in a line, but I am not sure what it actually means.

In C++ side source, it resolves to LineEntry.column value, but it also lacks how it is measured in.

At first, I thought it as UTF-8 code unit offset. But it seems it isn't because when I measure it it looks like UTF-16 code unit offset. But I still couldn't find any definition for this value.

What is this value?

  • Raw byte offset in source code file?
  • UTF-8 code unit offset?
  • UTF-16 code unit offset?
  • Something else?
eonil
  • 83,476
  • 81
  • 317
  • 516
  • Just a note. There's a small discussion about this: https://zulip-archive.rust-lang.org/187780tcompilerwgllvm/51206DWARFdebuglinecolumnandutf8.html – eonil Dec 30 '19 at 06:53
  • Issue related to this discussion https://github.com/rust-lang/rust/issues/67360 – eonil Dec 30 '19 at 06:54

2 Answers2

1

That's a good question! If the debug information is DWARF (except for Windows systems, it is), lldb is providing the DNS_LNS_set_column data from the DWARF line table as the number returned by SBLineEntry::GetColumn(). The DWARF5 specification doesn't say what this integer is counting -- it says only,

The DW_LNS_set_column opcode takes a single unsigned LEB128 operand and stores it in the column register of the state machine.

You're probably seeing that clang puts the UTF-16 code unit offset in the DWARF, but the standard doesn't require that. This would be a reasonable clarification request to file with the DWARF standards committee, http://dwarfstd.org

Jason Molenda
  • 14,835
  • 1
  • 59
  • 61
  • So it's up to the compilers? In my case, I'm inspecting Rust programs, therefore UTF-8 was expected, but as it is using LLVM to emit debug info, it can be another encoding like UTF-16. This seems gonna need a new question. – eonil Dec 30 '19 at 06:37
  • It doesn't seem to be UTF-16. `rustc` is using "character count" (which is unclear to me) on this. https://github.com/rust-lang/rust/issues/67360 – eonil Dec 30 '19 at 06:56
  • I think the right place to fix this would be in the DWARF standard - right now there's no correct answer about how that value is counted. You can't blame a producer (compiler) for putting whatever it puts there right now, and the consumer (debugger) is just relaying the value from the debug information. For this value to be usable, the UI needs to know what it is counting (as you've pointed out); I think the debug info standard is the right place to specify this. – Jason Molenda Dec 30 '19 at 20:45
0

For the case of Rust programs, I think it's Unicode Scalar value offset.

They are repeating the word "char", and in Rust, "char" means Unicode Scalar Value.

eonil
  • 83,476
  • 81
  • 317
  • 516