Stop MMC queue from fetching new requests when communication with card times out

Question

We are using a custom board running version 3.2 of the kernel, but we've come across some issues when testing the robustness of the MMC layer.

First of all, our MMC slot does not have a card detection pin. The issue in question consists of the following:

Load the module (omap_hsmmc). The card is detected on power-up, and is mounted appropriately.
Read something from the SD card (i.e. cat foo.txt)
While the file is being read, remove the card.
After many failed attempts, the system hangs.

Now, I've tracked the issue to the following code section, in drivers/mmc/card/queue.c:

static int mmc_queue_thread(void *d)                                                                                    
{                                                                                                                       
   struct mmc_queue *mq = d;                                                                                            
   struct request_queue *q = mq->queue;                                                                                 

   current->flags |= PF_MEMALLOC;                                                                                       

   down(&mq->thread_sem);                                                                                               
   do {                                                                                                                 
      struct request *req = NULL;                                                                                       
      struct mmc_queue_req *tmp;                                                                                        

      spin_lock_irq(q->queue_lock);                                                                                     
      set_current_state(TASK_INTERRUPTIBLE);                                                                            
      req = blk_fetch_request(q);                                                                                       
      mq->mqrq_cur->req = req;                                                                                          
      spin_unlock_irq(q->queue_lock);                                                                                   

      if (req || mq->mqrq_prev->req) {                                                                                  
         set_current_state(TASK_RUNNING);                                                                               
         mq->issue_fn(mq, req);                                                                                         
      } else {                                                                                                          
         if (kthread_should_stop()) {                                                                                   
            set_current_state(TASK_RUNNING);                                                                            
            break;                                                                                                      
         }                                                                                                              
         up(&mq->thread_sem);                                                                                           
         schedule();                                                                                                    
         down(&mq->thread_sem);                                                                                         
      }                                                                                                                 

      /* Current request becomes previous request and vice versa. */                                                    
      mq->mqrq_prev->brq.mrq.data = NULL;                                                                               
      mq->mqrq_prev->req = NULL;                                                                                        
      tmp = mq->mqrq_prev;                                                                                              
      mq->mqrq_prev = mq->mqrq_cur;                                                                                     
      mq->mqrq_cur = tmp;                                                                                               
   } while (1);                                                                                                         
   up(&mq->thread_sem);                                                                                                 

   return 0;                                                                                                            
}

Investigating this code, I've found that it hangs inside the mq->issue_fn(mq, req) call. This function prepares and issues the appropriate commands to fufill the request passed into it, and it is aware of any errors that happens when it can't communicate with the card. The error is handled in an awkward (in my opinion) manner, and does not 'bubble up' to the mmc_queue_thread. However, I've checked my version's code against that of the most recent kernel release (4.9), and I did not see any difference in the treatment of such errors, apart from a better separation of each error case (the treatment is very similar).

I suspect that the issue is caused by the inability ofthe upper layers to stop making new requests to read from the MMC card.

What I've Tried:

Rewrote the code so that the error is passed up so that I can do ret = mq->issue_fn(mq, req).
Being able to identify the specific errors, I tried to cancel the thread in different ways: calling kthread_stop, calling mmc_queue_suspend, __blk_end_request, and others. With these, the most I've been able to accomplish was keep the thread in a "harmless" state, where it still existed, but did not consume any resources. However, the user space program that triggered the call does not return, locking up in an uninterruptible state.

My questions:

Is the best way to solve this by stopping the upper layers from making new requests? Or should the thread itself be 'killed off'?
In a case such as mine, should it be assumed that because there is no card detection pin, the card should not be removed?

Update: I've found out that you can tell the driver to use polling to detect card insertion/removal. It can be done by adding the MMC_CAP_NEEDS_POLL flag to the mmc's .caps on driver initialization (on newer kernels, you can use the broken-cd property on your DT). However, the problem still happens after this modification.

What is the board that you use? And more specifically, what is the SoC? Also, which exactly kernel do you use (link to sources)? — Sam Protsenko, Jan 31 '17 at 10:04
@Sam Protsenko, we use a custom board with AM335x. The kernel version is the [3.2 version from Texas Instruments](https://git.ti.com/ti-linux-kernel/ti-linux-kernel/trees/v3.2). — Guilherme Costa, Jan 31 '17 at 12:04
@andriy, since creating this question, I've found out that you can indicate to the kernel that it should try to detect the card by polling. This can be done or'ing the `MMC_CAP_NEEDS_POLL` into the mmc's `.caps`. This way the kernel detects that the card has been removed, but the problem I descrined persists. Thank you for your feedback. =] — Guilherme Costa, Jan 31 '17 at 12:09
@GuilhermeCosta When talking about OMAP devices (and you're using OMAP MMC driver), it's always a good idea to check pending patches (or patches merged into another branches) on review.omapzoom.org. For starters, check if next patches are merged in your branch or not (was found by "mmc detect" request): [patch 1](http://review.omapzoom.org/#/c/30291/), [patch 2](http://review.omapzoom.org/#/c/11754/), [patch 3](http://review.omapzoom.org/#/c/10568/), [patch 4](http://review.omapzoom.org/#/c/9463/), [patch 5](http://review.omapzoom.org/#/c/6571/). — Sam Protsenko, Jan 31 '17 at 12:29
@GuilhermeCosta Also look in latest mainline kernel, probably some new stuff was merged for `omap-hsmmc` driver. Also you can check pending patches on kernel mailing lists. When working on outdated branches, you can often find solutions on mentioned sources. So I think it's worth to check them first, before trying to fix the problem by your own. — Sam Protsenko, Jan 31 '17 at 12:32
@GuilhermeCosta As @0andriy mentioned, you can specify `ti,non-removable` option in your Device Tree file (see [Documentation/devicetree/bindings/mmc/ti-omap-hsmmc.txt](http://lxr.free-electrons.com/source/Documentation/devicetree/bindings/mmc/ti-omap-hsmmc.txt#L22)). Isn't it an option? — Sam Protsenko, Jan 31 '17 at 12:41
@SamProtsenko, unfortunately it is not. I already compared my branch with the mainline kernel, and found nothing. I will look once more though, as I focused more on the block.c file, and not queue.c. I will look through the resources you pointed me to, thanks. — Guilherme Costa, Jan 31 '17 at 12:45
@GuilhermeCosta That issue is most likely should be fixed inside of hardware driver (`omap_hsmmc.c`), and you probably shouldn't mess with core code, like `queue.c` and `block.c`, as it's usually working correctly and well debugged. Look at `git log -S'insertion' -- drivers/mmc/host/omap_hsmmc.c` output (in mainline kernel). — Sam Protsenko, Jan 31 '17 at 13:00
@GuilhermeCosta Start from reviewing next patches: [#1](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=7efab4f35740c63502e438886cf1e4aa3f3b800f), [#2](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=e03de74516ec434aea77cfcf276df9c87fc7285a), [#3](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=11227d12397c6dd5c578e27210aa14e3fa44f11c) and go from there. — Sam Protsenko, Jan 31 '17 at 13:01

score 1 · Accepted Answer · edited Feb 08 '17 at 12:47

Figured out the solution!

The issue was, as I suspected, that the block requests kept being issued even after the card was removed. The error was fixed in commit a8ad82cc1b22d04916d9cdb1dc75052e80ac803c from Texas' kernel repository:

commit a8ad82cc1b22d04916d9cdb1dc75052e80ac803c
Author: Sujit Reddy Thumma <sthumma@codeaurora.org>
Date:   Thu Dec 8 14:05:50 2011 +0530

    mmc: card: Kill block requests if card is removed

    Kill block requests when the host realizes that the card is
    removed from the slot and is sure that subsequent requests
    are bound to fail. Do this silently so that the block
    layer doesn't output unnecessary error messages.

    Signed-off-by: Sujit Reddy Thumma <sthumma@codeaurora.org>
    Acked-by: Adrian Hunter <adrian.hunter@intel.com>
    Signed-off-by: Chris Ball <cjb@laptop.org>

The key in the commit is that a return BLKPREP_KILL was added to the mmc_prep_request(), besides some minor tweaks that added some 'granularity' to the error treatment, adding a specific code for the cases where the card was removed.

FYI, that patch was upstreamed in k3.3, [here](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=a8ad82cc1b22d04916d9cdb1dc75052e80ac803c). — Sam Protsenko, Feb 08 '17 at 12:51

Stop MMC queue from fetching new requests when communication with card times out

1 Answers1