1

I'm developing some Android devices that are constantly performing fairly intensive tasks.

I've noticed a strange issue happen (quite rarely, generally after 2-3 weeks of running continuously) where a device ceases to function, and all communications with it die. Since I don't have any access to these devices I can only assume that the OS has killed all running processes (there are several processes on it that communicate with several different backend servers, and they all disconnect simoultaneously)

I'm currently getting around this by implementing a firmware watchdog (by compiling it from source), but I am trying to figure out what is causing the devices to die in the first place.

Is there some android functionality that kills all processes and requires a reboot to fix? What can i do to avoid this happening? Are there any logs that I can view which show when this occurs?

Onik
  • 19,396
  • 14
  • 68
  • 91
A_toaster
  • 1,196
  • 3
  • 22
  • 50
  • 1
    Please tell me you are not running some kind of botnet or malicious app on android :S – leonardkraemer Apr 05 '18 at 22:26
  • ahaha what makes you say that? I'm running computer vision algorithms and I'm utilising snapdragons as they allow on-processor computing rather than having to upload TB's of video – A_toaster Apr 06 '18 at 00:16
  • What is your usecase? – leonardkraemer Apr 06 '18 at 00:21
  • collecting age/gender info from customers in a retail chain – A_toaster Apr 06 '18 at 00:21
  • By computer vision? That sounds like a invasion of privacy to me. Otherwise your users would restart your service. – leonardkraemer Apr 06 '18 at 00:23
  • Perhaps, but that seems outside the scope of this question :P – A_toaster Apr 06 '18 at 00:47
  • Your question sounds like a request for a zero day exploit of android. Shut-down must not happen without user interaction, the battery running out or mandatory system update. Kernel panic is the other, unintended way. You are asking for that. – leonardkraemer Apr 06 '18 at 00:58
  • Ahh okay, so it must be kernel panic since the board we're using (dragonboard 410) has no battery (I think it's set to 50% in firmware). If I recall correctly there is a compilation flag that is set to reboot the Android device under that circumstance. If what you're saying is true then I can just set that flag and be good to go. – A_toaster Apr 06 '18 at 01:29
  • If that solved your problem I'm very happy. – leonardkraemer Apr 06 '18 at 01:33
  • It is not about my liking, but the potential for harm. Security is a big issue. But with what you are describing you should be able to get `root`privileges on the devices and reboot them any time you like. – leonardkraemer Apr 06 '18 at 01:43
  • You don't need to kill all processes, it's enough that some critical service exits (e.g. crashes) 4 times in 4 minutes, the device will reboot into recovery mode (see https://android.googlesource.com/platform/system/core/+/master/init/README.md) – Alex Cohn Apr 07 '18 at 09:20
  • @AlexCohn Ah that's very interesting, thank you. I think my problem either lies in critical service failure or kernel panic. I have managed to test the watchdog - it works when I purposely crash Android with a kernel panic (by writing 'c' to sysrq-trigger). I wonder how I can possibly test the watchdog against this critical service failure. Am wondering if the firmware watchdog still operates when a device boots into recovery mode – A_toaster Apr 08 '18 at 22:21

2 Answers2

2

Don't know what have done to the AOSP, but there do have some mechanisms to make a system reboot.

In init.rc, if a service is note as "critical" then if the service crashes more than 4 times, the system will reboot to the recovery mode.

In framework, if the a service belongs to core service and crashed, the system will restart the whole android, but not the kernel.

Temperature, there are two types temperature reboot schedule. One is CPU heat, but this has nothing to do with android, it is a CPU feature. Another, battery temperature, if a battery's heart is higher than expected, the healthd(a android demon on watching battery state) will notice the framework and the framework will reboot.

If the communication logic is written in a Android App, I suggest you to make this app as persist. This will make sure the app will stay in memory forever, and if the app is crashed, the system will restart this app. This may not solve you problem, but can resume the communication job.

I think it is not hard to figure out what's going on, usually the logcat contains the detail.

  • Doesn't logcat get cleared on reboot? Eg if I receive a device that has died in the field, will I be able to use logcat to determine what has caused it to die? – A_toaster Apr 09 '18 at 08:02
  • Yes logcat will disappear after reboot, so the usual way is to save the logcat. I'm not sure, but google seems (AOSP 8) save the log for you in /data/misc/logd. Again I'm not sure, you can write a code to save the log you self. With the log, I think it's easier to debug. –  Apr 09 '18 at 08:21
  • If one app want to be persist, it should be a system app firstly. – utzcoz Apr 10 '18 at 11:24
1

One of the explanations of your scenario is that the CPU overheats. In this case, not only the device will spontaneously shut down, it also cannot immediately reboot.

You may find temperature warnings in system log, but you can monitor this in your software, and throttle down CPU-intensive tasks to keep it from overheating.

Alex Cohn
  • 56,089
  • 9
  • 113
  • 307
  • Good point, I do have a constant temperature reading. Most devices sit around the 50-60 deg mark. One device was installed near a fridge exhaust and sits at 74 deg. As far as I know, the snapdragon throttles itself when it approaches critical temperature (around 80 deg). There is also a heatsink and fan installed directly onto the shielding of the CPU, so while this is a valid concern, I don't think this is the actual issue – A_toaster Apr 07 '18 at 06:37