Loading ...
Sorry, an error occurred while loading the content.

Re: [nslu2-linux] Re: Catching SIGSEGV

Expand Messages
  • David Given
    On 18/05/12 23:09, clerew5 wrote: [...] ... This is definitely getting out of my comfort zone (signals and threads mix like oil and cats), but I was under the
    Message 1 of 5 , May 18, 2012
    View Source
    • 0 Attachment
      On 18/05/12 23:09, clerew5 wrote:
      [...]
      > Ah! I had thought that all signals would be passed to all threads (which is indeed the case for signals arising from outside).

      This is definitely getting out of my comfort zone (signals and threads
      mix like oil and cats), but I was under the impression that outside
      signals to sent to a single thread *at random* that had the signal
      unblocked? So you control which thread you want to receive the signals
      by blocking them from everywhere else.

      [...]
      > void errHandler(int signum, siginfo_t *info, void *ptr) {
      > int count = backtrace( tracePtrs, 100 );
      > backtrace_symbols_fd(tracePtrs, count, 2);
      [...]
      > But the stack it prints bears no resemblance to what I get from 'bt' in gdb. It seems to start from errHandler, but after that it bears no resemblance to anything recognizable; and it is not just because it arises within a handler, because I have manually invoked it from elsewhere in the program, and it still does not work.

      I've tried the test program in the backtrace man page on my armhf box,
      and it doesn't work. I'm afraid that it's possible that backtrace simply
      doesn't work on ARM.

      > My plan is to embed the whole program (which has to run 24/7) within a shell script which observes the failures, records what it can in a suitable file, and then restarts the program. But it is important that things should be cleaned up before the program is removed...

      You may be treading on thin ice here. Depending on what causes the seg
      fault, it's quite possible that the system will be in a bad state ---
      for example, if you call fwrite(), and the buffer is unreadable, then
      it's entirely likely that the signal will be thrown while it's in the
      middle of modifying the stdio state. Which means that trying to use
      stdio again will hang, crash, etc. The magic keyword to search for is
      'async signal safe'.

      (Related: there is a very limited list of operations that you can safely
      do inside a signal handler. Calling exit() is not one of them! See here:
      https://www.securecoding.cert.org/confluence/display/seccode/SIG30-C.+Call+only+asynchronous-safe+functions+within+signal+handlers)

      This means that once you're program's crashed, you may not be able to
      safely clean up afterwards.

      What sort of cleanup do you need to do? Recording the program's state,
      or freeing resources? If the latter, is there any way you can persuade a
      different process to do the cleanup for you? That way, you can just let
      your program crash without needing to do any actual work from your
      signal handler.

      --
      ┌─── dg@cowlark.com ───── http://www.cowlark.com ─────

      │ "Never attribute to malice what can be adequately explained by
      │ stupidity." --- Nick Diamos (Hanlon's Razor)
    • clerew5
      ... Yes, that is what I thought, and the thread in question was the only one allowed to see the signals (the prime purpose off that thread is to catch people
      Message 2 of 5 , May 21, 2012
      View Source
      • 0 Attachment
        --- In nslu2-linux@yahoogroups.com, David Given <dg@...> wrote:
        >
        > On 18/05/12 23:09, clerew5 wrote:
        > [...]
        > > Ah! I had thought that all signals would be passed to all threads (which is indeed the case for signals arising from outside).
        >
        > This is definitely getting out of my comfort zone (signals and threads
        > mix like oil and cats), but I was under the impression that outside
        > signals to sent to a single thread *at random* that had the signal
        > unblocked? So you control which thread you want to receive the signals
        > by blocking them from everywhere else.

        Yes, that is what I thought, and the thread in question was the only one allowed to see the signals (the prime purpose off that thread is to catch people who suddenly press buttons). But it seems that SIGSEGV (unless generated externally) is only sent to the thread it happened in, or maybe its parents.

        > > But the stack it prints bears no resemblance to what I get from 'bt' in gdb. It seems to start from errHandler, but after that it bears no resemblance to anything recognizable; and it is not just because it arises within a handler, because I have manually invoked it from elsewhere in the program, and it still does not work.
        >
        > I've tried the test program in the backtrace man page on my armhf box,
        > and it doesn't work. I'm afraid that it's possible that backtrace simply
        > doesn't work on ARM.

        Actually, it is doing better than I thought. Here is some actual output:

        ./heat(pthread_create+0x710)[0x9da0]
        /lib/libc.so.6(__default_rt_sa_restorer_v2+0x0)[0x4010bcf0]
        /lib/libc.so.6(_IO_printf+0x34)[0x40120dfc]
        ./heat[0x14a18]
        ./heat[0x17274]
        ./heat(pthread_create+0x680)[0x9d10]

        The addresses within [...] are indeed the addresses of code being obeyed all down the stack. But [0x9da0] is NOT within pthread_create (nm shows it is actually within my handler). Likewise [0x9d10] is within 'main', and [0x14a18] and [0x17274] are at identifiable places which gave me the clue as to where the fault was happening (though there appears to be one stack frame which does not appear at all).

        BUT I then assumed that the claimed routines shown within /lib/libc.so.6 would be equally bogus, whereas research using /proc/<pid>/maps and nm showed that they were in fact correct, and the fault was indeed in printf (I had mistyped '%d' as '%n').

        So there is definitely a bug in backtrace which is causing wrong identification within code compiled by myself, but which which operates fine within the system libraries.
        >
        > > My plan is to embed the whole program (which has to run 24/7) within a shell script which observes the failures, records what it can in a suitable file, and then restarts the program. But it is important that things should be cleaned up before the program is removed...
        >
        > You may be treading on thin ice here. Depending on what causes the seg
        > fault, it's quite possible that the system will be in a bad state ---
        > for example, if you call fwrite(), and the buffer is unreadable, then
        > it's entirely likely that the signal will be thrown while it's in the
        > middle of modifying the stdio state. Which means that trying to use
        > stdio again will hang, crash, etc. The magic keyword to search for is
        > 'async signal safe'.

        Yes, you have to be aware of what code you are going to obey, and what exit() is likely to do, and there is a risk that these might trigger the same fault again. But in my case the system is controlling a heating system and turning real boilers on and off, and it is absolutely essential that, upon a crash, the boiilers are most definitely left OFF. Also, there are a few variables that ought to be preserved in permanent storage.

        But so far it all seems to be working fine. The system crashed, restarted itself within one second, and crashed again shortly after - until eventually the temperatures had risen to a point where the offending piece of code did not need to be called anymore.
      Your message has been successfully submitted and would be delivered to recipients shortly.