15

I'm implementing an algorithm (SpookyHash) that treats arbitrary data as 64-bit integers, by casting the pointer to (ulong*). (This is inherent to how SpookyHash works, rewriting to not do so is not a viable solution).

This means that it could end up reading 64-bit values that are not aligned on 8-byte boundaries.

On some CPUs, this works fine. On some, it would be very slow. On yet others, it would cause errors (either exceptions or incorrect results).

I therefore have code to detect unaligned reads, and copy chunks of data to 8-byte aligned buffers when necessary, before working on them.

However, my own machine has an Intel x86-64. This tolerates unaligned reads well enough that it gives much faster performance if I just ignore the issue of alignment, as does x86. It also allows for memcpy-like and memzero-like methods to deal in 64-byte chunks for another boost. These two performance improvements are considerable, more than enough of a boost to make such an optimisation far from premature.

So. I've an optimisation that is well worth making on some chips (and for that matter, probably the two chips most likely to have this code run on them), but would be fatal or give worse performance on others. Clearly the ideal is to detect which case I am dealing with.

Some further requirements:

  1. This is intended to be a cross-platform library for all systems that support .NET or Mono. Therefore anything specific to a given OS (e.g. P/Invoking to an OS call) is not appropriate, unless it can safely degrade in the face of the call not being available.

  2. False negatives (identifying a chip as unsafe for the optimisation when it is in fact safe) are tolerable, false positives are not.

  3. Expensive operations are fine, as long as they can be done once, and then the result cached.

  4. The library already uses unsafe code, so there's no need to avoid that.

So far I have two approaches:

The first is to initialise my flag with:

private static bool AttemptDetectAllowUnalignedRead()
{
  switch(Environment.GetEnvironmentVariable("PROCESSOR_ARCHITECTURE"))
  {
    case "x86": case "AMD64": // Known to tolerate unaligned-reads well.
      return true;
  }
  return false; // Not known to tolerate unaligned-reads well.
}

The other is that since the buffer copying necessary for avoiding unaligned reads is created using stackalloc, and since on x86 (including AMD64 in 32-bit mode), stackallocing a 64-bit type may sometimes return a pointer that is 4-byte aligned but not 8-byte aligned, I can then tell at that point that the alignment workaround isn't needed, and never attempt it again:

if(!AllowUnalignedRead && length != 0 && (((long)message) & 7) != 0) // Need to avoid unaligned reads.
{
    ulong* buf = stackalloc ulong[2 * NumVars]; // buffer to copy into.
    if((7 & (long)buf) != 0) // Not 8-byte aligned, so clearly this was unnecessary.
    {
        AllowUnalignedRead = true;
        Thread.MemoryBarrier(); //volatile write

This latter though will only work on 32-bit execution (even if unaligned 64-bit reads are tolerated, no good implementation of stackalloc would force them on a 64-bit processor). It also could potentially give a false positive in that the processor might insist on 4-byte alignment, which would have the same issue.

Any ideas for improvements, or better yet, an approach that gives no false negatives like the two approaches above?

Jon Hanna
  • 110,372
  • 10
  • 146
  • 251
  • Definitely, DEFINITELY go with the first one. A proper whitelist is the only way to go here (what if ARMv9 emulates but emulates inefficiently, et cetera). The only change I would make is to put the whitelist in app.config so you could verify new architectures and enable optimizations without rebuilding/redeploying. – Stu Jan 06 '14 at 18:24
  • (Actually, it's worse -- with the second one you're not just hitching your wagon to current CPU implementations, but to Mono's implementation as well. What if they, in a future version, all of a sudden decide to be "helpful" with alignment issues?) – Stu Jan 06 '14 at 18:27
  • Okay, one more, regardless of which path you choose: if you cache your result between executions, to prevent weird situations on VMs, do not just cache the flag, also cache what arch you got the result on. Then on startup check if you're still running on the same arch. – Stu Jan 06 '14 at 18:29
  • @Stu, the first is working well, the second has the logical flaw I noted in an edit that it doesn't mean the processor isn't okay with 4-byte align but will fail if not on the 4-byte. I don't like the idea of caching any more permanently than a static variable in memory though; as it's a library, I shouldn't require file-writing permissions (even in some sandbox) if I don't do any other file-related tasks, IMO. – Jon Hanna Jan 06 '14 at 20:09
  • @Stu, the one big risk with the first, is that an x86 or AMD64 comes out that doesn't tolerate such alignment issues. That happened with the PowerPC chips, IIRC, they became stricter on alignment than they once were. – Jon Hanna Jan 06 '14 at 20:10
  • Good point, but there's no way no how that x86/x64 could do that without breaking, well, the fricking world. Perhaps loading a little DLL through reflection that on Windows uses the proper way of doing it for the OS you are on? (for Windows, IIRC that would be calling GetNativeSystemInfo and checking dwAllocationGranularity). Having a load-OS-specific-stuff-from-satellite-DLL-with-common-interface thingy might not be bad to have in your library for other things anyway. – Stu Jan 06 '14 at 20:31
  • Last comment should be "...through reflection that uses the...". I think it was moderately clear, but I want to make sure. – Stu Jan 06 '14 at 20:33
  • @Stu, in a way that's how Bob Jenkins‘ original code does it, in that since it's C++ source you can change compilation as needed. It's not quite the .NET way though. Take a look at https://bitbucket.org/JonHanna/spookilysharp if you feel like experimenting with the issue yourself. – Jon Hanna Jan 06 '14 at 20:39
  • I wouldn't mind helping with setting up the Reflection route if you like, but I'm Wintel only, so I could only do the Wintel implementation. And of course it's the .NET way. Just the seedy back alleys of .NET way. – Stu Jan 06 '14 at 21:32
  • @Stu. What are you thinking of looking for through reflection? – Jon Hanna Jan 06 '14 at 21:38
  • 1) Determine platform (Mono, .NET, Win, Linux 2) Load a DLL through reflection specific to that platform -- all these DLLs implement IGetPlatformSpecificCrud 3) Call into that DLL. -- this way every platform can do The Right Thing. I could do the scaffolding and the Windows version (with GetNative...()), but I only run Windows. (Downloading Mono now, curious what they've done over the past year) – Stu Jan 06 '14 at 22:40
  • @Stu, I can't think how I'd do this even in unmanaged code, so I'd certainly find the details of your plans there to be interesting. I'm wary of putting anything in the download that isn't managed code, but I'd be interested in learning the plan whether I go with it or not. – Jon Hanna Jan 06 '14 at 23:20
  • This is 100% managed. I'll see when I have time (setting up my Git and VS environment right now). Always fun to show a 35K+ pointer something new. Well, at least it's fun for me :-) – Stu Jan 06 '14 at 23:38
  • The risk is not that some future x86 processor begins faulting on unaligned access, but that some future x86 processor has a significantly larger performance hit for unaligned access. – Ben Voigt Jan 06 '14 at 23:43
  • @BenVoigt The risk really, is either. – Jon Hanna Jan 07 '14 at 00:09
  • @Jon: A processor which faulted on unaligned access would not be compatible with existing software and could not be considered the same architecture or same instruction set. – Ben Voigt Jan 07 '14 at 01:08
  • Thank you Ben, for far better expressing my motivation for a whitelist; it's not just "who doesn't miserably crash", it's "who benefits". And this is not necessarily just CPU alignment; in certain scenarios, the associativity of the L2 cache could very well be a player. (Cache line size, mem bus width, quiet AVX optimizations in your JIT, doop-dee-doop, we could do this all day) – Stu Jan 07 '14 at 01:38
  • Another random aside, Jon... there might be benefits in using straight Win API bitblt type functions (that the reflected-in platform-specific DLL would host). See http://stackoverflow.com/questions/8951775/win32-api-functions-vs-their-crt-counterparts-e-g-copymemory-vs-memcpy -- of course, only benching would give the answer. Are you still good with me trying to implement the platform-specific DLL? – Stu Jan 07 '14 at 01:44
  • @BenVoigt The PowerPC architecture had just such a change happen during its history. The worse thing is that it would just die dead without a catchable exception (if it weren't for that possibility, the obvious answer to the above would be to spawn off a thread that just tried and saw what happened, and how quickly). – Jon Hanna Jan 07 '14 at 03:06
  • @Stu. Most of the blts are relatively small, and for those ranges calling into APIs cost more than they bring (and I did indeed test it here). As for trying your own implementation, the license on the code clearly allows you to, if nothing else ;) But yes, I'm keen to see how it goes for you. – Jon Hanna Jan 07 '14 at 03:09
  • On Windows, can you still leverage Mono? If so, perhaps the Mono.Simd extensions may give you a slightly different approach - [check it out.](http://docs.go-mono.com/?link=T%3aMono.Simd.SimdRuntime) This is more in the spectrum of accelerations, rather than a direct solution. – J Trana Jan 07 '14 at 03:11
  • @JTrana you can if running on the Mono runtime for Windows, but not with .NET. Could possibly pull in through assembly loading, and then fallback on failure, I suppose. That said, SIMD having 16-byte alignments would mean the bltting couldn't be done in precisely the case that involves the most bltting! – Jon Hanna Jan 07 '14 at 03:14
  • @Jon, have you seen this? It has been done before. Okay, let me be more specific -- we just implemented this on 2500x2500 matrices for our simulations at work and it's a 1 hour -> 12 second type of improvement (on modern Intel CPUs): http://mathnetnumerics.codeplex.com/wikipage?title=Native%20Providers&referringTitle=Documentation – Stu Jan 07 '14 at 03:38
  • @Stu. That's cool. I think though my little library just doesn't offer enough for anyone to bother with any installation that goes beyond "Search NuGet. Click Add. Click Accept". If I do something more awesome, I can get to complicate their installs ;) – Jon Hanna Jan 07 '14 at 03:42
  • Hmmm. Ok, I've got an idea - how about this? Is there a way that you could use Marshal.SizeOf to check for size discrepancies with a specially formed struct and LayoutKind.Sequential? So let's say the struct was int32, int64. If the processor needed alignment, I would think it would pad the first one to 8 bytes and you would get 16. Otherwise if the processor didn't, it would only give 12... Thoughts? – J Trana Jan 07 '14 at 04:43
  • @JTrana no, a normal layout would have the padding you'd expect on a chip that tolerates unaligned reads, and forcing an unaligned read on a chip that doesn't could just kill the process (or even the OS, though I don't think any modern chips are quite as grumpy as that). – Jon Hanna Jan 07 '14 at 04:53
  • Hmm, maybe I'm a bit confused then. I guess I thought you wanted to more or less detect the CPU's natural alignment, usually 4 or 8. Wouldn't a Marshal.SizeOf based on the compiler's choice of layout (but importantly Sequential, NOT Auto) do essentially that? The point is that you *are* detecting the padding, indicating that the compiler thinks you should be aligning. Specifically, doesn't the compiler usually make the choice for optimal access vs smallest memory? So even if the processor tolerates unaligned reads, the compiler would usually pick an optimal alignment, no? – J Trana Jan 07 '14 at 05:17
  • @Jtrana, no with this case I've an algorithm that naturally involves doing unaligned reads. Some chips are so tolerant of unaligned reads that the work taken to avoid it costs more than just ignoring the issue, whereas other chips aren't. So it's a matter of trying to know if ignoring the issue is (1) safe (2) faster, in a given case. – Jon Hanna Jan 07 '14 at 10:30
  • Derp, never mind -- GetNativeSystemInfo.AllocationGranularity is... allocation granularity, not alignment. Nothing to see here, move along. – Stu Jan 07 '14 at 15:39
  • @Stu, that and the tricky bit is (close to what J Trana brought up); knowing the alignment isn't the question, but rather knowing whether or not I can break it and not suffer. I get your thinking though, and might experiment with checking wProcessorArchitecture as more reliable than the environment variable. Likewise I'm going to look at hitting `uname()` on `Mono.Posix.Native` when available. – Jon Hanna Jan 07 '14 at 16:44
  • Okidoki. I did get the infrastructure for the platform-specific DLL up, let me know if you're interested. – Stu Jan 07 '14 at 22:48
  • @Jon: I don't believe there's a real risk for the x86 ISA doing what PowerPC did, however. Intel and AMD have already confronted the reality of needing to force only aligned access at the microcode level, and they've chosen to convert unaligned access to aligned in the instruction-decode circuitry without exposing a trap to the user. These are very mature processor designs with huge amounts of existing software, they aren't likely to make such a change, especially at the cost of backward compatibility. Not only do they emulate the unaligned access in hardware, they've done it cheaply. – Ben Voigt Jan 07 '14 at 23:02
  • Still, I think this library is intended to run on any architecture -- ARM comes to mind that can die horribly or perform horribly with things like this. – Stu Jan 07 '14 at 23:29
  • Well, thanks to all of you, but especially @Stu. I've answered my own question with an approach that does well enough to serve, and I owe it to this comment thread. – Jon Hanna Jan 08 '14 at 10:16

1 Answers1

4

Well, here is my own final-for-now answer. While I'm answering my own question here, I owe a lot to the comments.

Ben Voigt and J Trana's comments made me realise something. While my specific question is a boolean one, the general question is not:

Pretty much all modern processors have a performance hit for unaligned reads, it's just that with some that hit is so slight as to be insignificant compared to the cost of avoiding it.

As such, there really isn't an answer to the question, "which processors allow unaligned reads cheaply enough?" but rather, "which processors allow unaligned reads cheaply enough for my current situation. As such, any fully consistent and reliable method isn't just impossible, but as a question unrelated to a particular case, meaningless.

And as such, white-listing cases known to be good enough for the code at hand, is the only way to go.

It's to Stu though that I owe managing to get my success with Mono on *nix up to that I was having with .NET and Mono on Windows. The discussion in the comments above brought my train of thought to a relatively simple, but reasonably effective, approach (and if Stu posts an answer with "I think you should base your approach on having platform-specific code run safely", I'll accept it, because that was the crux of one of his suggestions, and the key to what I've done).

As before I first try checking an environment variable that will generally be set in Windows, and not set on any other OS.

If that fails, I try to run uname -p and parse the results. That can fail for a variety of reasons (not running on *nix, not having sufficient permissions, running on one of the forms of *nix that has a uname command but no -p flag). With any exception, I just eat the exception, and then try uname -m, which his more widely available, but has a greater variety of labels for the same chips.

And if that fails, I just eat any exception again, and consider it a case of my white-list not having been satisfied: I can get false negatives which will mean sub-optimal performance, but not false positives resulting in error. I can also add to the white-list easily enough if I learn a given family of chips is similarly better off with the code-branch that doesn't try to avoid unaligned reads.

The current code looks like:

[SuppressMessage("Microsoft.Design", "CA1031:DoNotCatchGeneralExceptionTypes",
  Justification = "Many exceptions possible, all of them survivable.")]
[ExcludeFromCodeCoverage]
private static bool AttemptDetectAllowUnalignedRead()
{
  switch(Environment.GetEnvironmentVariable("PROCESSOR_ARCHITECTURE"))
  {
    case "x86":
    case "AMD64": // Known to tolerate unaligned-reads well.
      return true;
  }
  // Analysis disable EmptyGeneralCatchClause
  try
  {
    return FindAlignSafetyFromUname();
  }
  catch
  {
    return false;
  }
}
[SecuritySafeCritical]
[SuppressMessage("Microsoft.Design", "CA1031:DoNotCatchGeneralExceptionTypes",
  Justification = "Many exceptions possible, all of them survivable.")]
[ExcludeFromCodeCoverage]
private static bool FindAlignSafetyFromUname()
{
  var startInfo = new ProcessStartInfo("uname", "-p");
  startInfo.CreateNoWindow = true;
  startInfo.ErrorDialog = false;
  startInfo.LoadUserProfile = false;
  startInfo.RedirectStandardOutput = true;
  startInfo.UseShellExecute = false;
  try
  {
    var proc = new Process();
    proc.StartInfo = startInfo;
    proc.Start();
    using(var output = proc.StandardOutput)
    {
      string line = output.ReadLine();
      if(line != null)
      {
        string trimmed = line.Trim();
        if(trimmed.Length != 0)
          switch(trimmed)
          {
            case "amd64":
            case "i386":
            case "x86_64":
            case "x64":
              return true; // Known to tolerate unaligned-reads well.
          }
      }
    }
  }
  catch
  {
    // We don't care why we failed, as there are many possible reasons, and they all amount
    // to our not having an answer. Just eat the exception.
  }
  startInfo.Arguments = "-m";
  try
  {
    var proc = new Process();
    proc.StartInfo = startInfo;
    proc.Start();
    using(var output = proc.StandardOutput)
    {
      string line = output.ReadLine();
      if(line != null)
      {
        string trimmed = line.Trim();
        if(trimmed.Length != 0)
          switch(trimmed)
        {
          case "amd64":
          case "i386":
          case "i686":
          case "i686-64":
          case "i86pc":
          case "x86_64":
          case "x64":
            return true; // Known to tolerate unaligned-reads well.
          default:
            if(trimmed.Contains("i686") || trimmed.Contains("i386"))
              return true;
            return false;
        }
      }
    }
  }
  catch
  {
    // Again, just eat the exception.
  }
  // Analysis restore EmptyGeneralCatchClause
  return false;
}
Jon Hanna
  • 110,372
  • 10
  • 146
  • 251
  • Dirtier, but much more succinct than my approach. I'd stick with this. – Stu Jan 08 '14 at 19:06
  • @Stu, yeah I think the fact that trying to run a process I'm not confident is there goes against my defensive instincts is why I didn't think of it earlier; it does seem a bit dirty. Your approach triggered the train of thought that got me there thought, so a big thank you for that. – Jon Hanna Jan 09 '14 at 10:04
  • @Stu, incidentally, I also found a nice way to make this not a concern when it comes to the memcpy and memset implementations, though I haven't added it to the code base. Basically it comes down to using `cpblk` and `initblk` from an assembly written in IL, and letting them deal with it appropriately, except for on x86, because `cpblk` sucks on x86, but x86 is nicely one of the two cases I know don't need alignment considerations. – Jon Hanna Jan 10 '14 at 11:57