11

Is there a way to do a complete memory test on an Android device's RAM?

I'm developing a driver, but at random times I get certain physical addresses with the wrong value, causing the driver to go into the wrong state. I'm trying to read from RAM when I hit the problem. I think certain portions of RAM on my device are corrupted.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Vector
  • 111
  • 1
  • 1
  • 4
  • Are you talking about actual RAM or a memory mapped device? MMDs just "look like" normal RAM but you're accessing some internal register of the device. In both cases, you should check if your cache settings for the memory-region is correct and if you're flushing/invalidation the cache correctly if it is used by the device and the CPU. – Nico Erfurth Jul 24 '12 at 22:23
  • If you really think that you've got corrupted memory you could check if your bootloader is providing any memory testing. – Nico Erfurth Jul 24 '12 at 22:30
  • I'm checking for the actual RAM. I have a ring buffer (circular linklist) which I'm checking for data. Unfortunately bootloader is not providing any memory testing. – Vector Jul 24 '12 at 22:48
  • Who is filling the ring buffer? If it is a different process, a device or the kernel you might have cache-problems. You should check that out first, faulty ram is highly unlikely. – Nico Erfurth Jul 24 '12 at 22:59
  • The hardware fills the ring buffer and generates an interrupt. The driver than has to fetch the data. It could be you are right. So I'm currently looking for potential cache problems. – Vector Jul 24 '12 at 23:09

2 Answers2

8

Complete is an ambiguous word. It may mean different temperatures, voltages and across a range of devices with different component tolerances. As you cite MemTest86, I think I understand. Most projects I have seen are C-based and can not test everything.

Here is one running under Linux - mturquette/memtest

There are algorithms documented such as walking bits, etc. A lot depends on your RAM type. I guess you have some type of SDRAM. There are many different cycles with SDRAM. There are single beat reads/write, bank-to-bank transfer, terminated bursts, etc.

Personally, we had a system where 5% of the boards would show problems when doing an SSH transfer over Ethernet (DMA). The SSH involves encryption which is CPU/memory-intensive and the DMA engine often does different SDRAM cycles than the CPU (with cache).

Here are some requirements,

  1. Non-SDRAM memory for code to reside.
  2. Bare metal framework (no cache, interrupts, DMA, etc.)
  3. Turn off the DCache.
  4. Turn on the ICache for the code.

Another limiting requirement is the time to run. A complete SDRAM test could take years to run on a single board. I have found that a pseudo random address/data test works well. Just take numbers that are relative prime to the size of the SDRAM and use that as an increment. The simplest case is 1. You might wish to find the others to constantly change rows, banks and device size; bank size-1 for example; however prime numbers will work better as you have different amounts of bits changing all the time. With the cache off, you can use char, short, int, and long long pointers to test some different burst lengths. These tests will be slow.

You will need to use ldm/stm pairs to simulate a full SDRAM burst. These are more common with the cache on, so you should simulate them with ldm/stm. This is also one of the fastest tests.

typedef unsigned char      b8;
typedef unsigned short     b16;
typedef unsigned long      b32;
typedef unsigned long long b64;

/* Use a macro to speed code.  The compiler will use constants for
 * _incr and _wrap instead of registers which cause spilling.  A
 * macro centralizes the memory test logic.
 */
#define MEMTEST(name,type,_incr,_wrap) ...

/* Sequential tests. */
MEMTEST(do_mem_seq8,   b8, 97, 1)
MEMTEST(do_mem_seq16, b16, 50839, 1)
MEMTEST(do_mem_seq32, b32, 3999971, 1)
MEMTEST(do_mem_seq64, b64, 3999971, 1)

/* Random tests. These test try to randomize both the data and the
 * address access.
 */

/* 97/0x61 prime for char and 9999991/0x989677 prime for 64MB. */
MEMTEST(do_mem_rnd8,b8,97,9999991)
/* 50839/C697 large prime for 64k and 9999991/0x989677 prime for 64MB. */
MEMTEST(do_mem_rnd16,b16,50839,9999991)
/* 3999971/3D08E3 prime and 9999991/0x989677 prime for 64MB. */
MEMTEST(do_mem_rnd32,b32,3999971,9999991)
/* 3999971/3D08E3 prime and 9999991/0x989677 prime for 64MB. */
MEMTEST(do_mem_rnd64,b64,3999971,9999991)

incr is the data increment and wrap is the address increment. The algorithm for the burst will be the same. Here is some inline gcc assembler,

    register ulong t1 asm ("r0")  = 0;                              \
    register ulong t2 asm ("r4")  = t1 + incr;                      \
    register ulong t3 asm ("r6")  = t2 + incr;                      \
    register ulong t4 asm ("r8")  = t3 + incr;                      \
        /* Run an entire burst line. */                             \
        __asm__ (" stmia  %[ptr], {%0,%1,%2,%3}\r\n" : :            \
                 "r" (t1), "r" (t2), "r" (t3), "r" (t4),            \
                 [ptr]"r" (start + (addr<<2)) :                     \
                 "memory" );                                        \
        /* Read four 32 bits values. */                             \
        __asm__ (" ldmia   %[ptr], {%0, %1, %2, %3}\r\n" :          \
                 "=r" (t1), "=r" (t2), "=r" (t3), "=r" (t4) :       \
                 [ptr]"r" (start + (addr<<2)) );                    \

These tests are simple and should fit in the code cache which will maximize stress on the RAM. Our main issue was the DQS delay which is critical for DDR-SDRAM and can be temperature and voltage dependent and will vary with PCB layout and materials.

Cachbench can be used if you are optimizing the memory controller registers with the SDRAM chips. It may also be useful for testing.

See also: Unix Stack Exchange (same question). I used these C based test suites under Linux, but they didn't expose any issues in our case. The memtest86 algorithms may not be as stressful (for PCB glitches) as what I describe above; although test 7 or the burnBX test is close. I think memtest86 caters to find DRAM chip issues as opposed to board design issues.

Another issue is transients/cross talk with the SDRAM chips. If your device driver is a high current or high frequency device, the SDRAM interface can possible pick up cross talk, or get a double clock due to supply variations. So a RAM test may show no issues and the SDRAM error only happens when a particular portion of hardware is used. Also be careful that the Android device doesn't use dynamic clocking and change the SDRAM frequency. Signals may cross a resonance as the clock changes.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
artless noise
  • 21,212
  • 6
  • 68
  • 105
  • Thanks for the detailed answer. My problem was a bug with DMA hardware which was found only by hardware verification. We were in very early stages of hardware development so everything worked out well. But I will keep all this info handy. It will be very useful. Thanks! – Vector Mar 25 '13 at 22:17
  • Or you could, you know, replace the RAM and see if the problem goes away. Then put it back and see if it returns. Then you know your RAM is faulty, and it won’t take years, or even hours. – Evi1M4chine Jan 28 '16 at 19:57
  • 1
    @Evi1M4chine Most embedded ARM platforms don't have removeable RAM (DIMMs or whatever) as it increases cost and board size. They are **hard** soldered uBGA chips which are difficult to remove. Also, the issue may not happen at all operating ranges. Ie, heat, wall power, and software running may all expose the problem (even if it was socketed). Some people writing embedded ARM software may have 1k-1M+ devices and it is not always all of them that fail. A good test can help to find problematic devices for further inspection. – artless noise Jan 29 '16 at 14:07
  • 1
    This information is very nice but does there exist *any* accessible memory test suite that doesn't need a host operating system running to work? I'm very accustomed to running batteries of memtest86 tests to improve my confidence in the DIMMs I use in my workstations and servers, but there seems to be no corresponding tool for ARM architecture? – Steven Lu Feb 14 '18 at 23:00
2

Das U-Boot is perhaps the most widely used boot loader on ARM boards, and it includes some memory test features.

Interestingly, its README suggests an alternative approach that might be more portable and/or more effective:

The best known test case to stress a system like that is to boot Linux with root file system mounted over NFS, and then build some larger software package natively (say, compile a Linux kernel on the system) - this will cause enough context switches, network traffic (and thus DMA transfers from the network controller), varying RAM use, etc. to trigger any weak spots in this area.

While you're building the Linux kernel, you might be interested in the CONFIG_MEMTEST=y option, which causes the built-in memory test to be built. This used to be for x86 architecture only, but I believe recent versions support it on other architectures as well, perhaps even ARM.

The memtester tool is already built and available in some Linux distributions, for various architectures, including ARM.

The kernel-memtest project might interest you as well.

Bear in mind that no tool can test the memory that it's running from (so a program in a running OS will have significant blind spots) and basic read/write tests won't reveal every type of defect or other error. Set your expectations accordingly, and if you have reason to suspect bad memory, consider trying several different test tools.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
ʇsәɹoɈ
  • 22,757
  • 7
  • 55
  • 61