Loading ...
Sorry, an error occurred while loading the content.

Catching SIGSEGV

Expand Messages
  • clerew5
    I am trying to trap a SIGSEGV (also SIGBUS, SIGILL and SIGFPE, but I have not even managed to generate those yet :-( ). I can easily get the program to produce
    Message 1 of 5 , May 17 9:59 AM
    • 0 Attachment
      I am trying to trap a SIGSEGV (also SIGBUS, SIGILL and SIGFPE, but I have not even managed to generate those yet :-( ).

      I can easily get the program to produce a message "Segmentation Violation" on stderr, but noway can I catch it before getting that message.

      I have written
      sigemptyset(&set);
      sigaddset(&set, SIGUSR1);
      sigaddset(&set, SIGUSR2);
      sigaddset(&set, SIGTERM);
      sigaddset(&set, SIGINT);
      sigaddset(&set, SIGSEGV);
      sigaddset(&set, SIGBUS);
      sigaddset(&set, SIGILL);
      sigaddset(&set, SIGFPE);
      sigaddset(&set, SIGSYS);
      and I have created a pthread which contains
      sigwait(&set, &signal);
      printf("Display signal %d\n", signal);
      switch (signal) {
      case SIGSEGV:
      case SIGBUS:
      case SIGILL:
      case SIGFPE:
      case SIGSYS:
      fatal("Heat failed: %s", strerror(errno));
      case SIGTERM:
      case SIGINT:
      warn("Heat terminated");
      exit(0);
      default:
      fatal("Unanticipated signal %d", signal);
      }
      but the SIGSEGV is never caught (though SIGTERM is caught fine, as is an explicit "kill -s SEGV").

      How can I fix this? 'warn' and 'fatal' do what you would expect, and 'fatal' calls 'exit(-1).

      Note also that arms tend not to raise SIGFPE even if you do the most terrible divisions by zero. Apparently there is now a linux patch to fix this.
    • David Given
      On 17/05/12 17:59, clerew5 wrote: [...] ... I m not quite sure what you re trying to do here --- AIUI sigwait() blocks until a signal *for the blocked thread*
      Message 2 of 5 , May 17 3:12 PM
      • 0 Attachment
        On 17/05/12 17:59, clerew5 wrote:
        [...]
        > sigemptyset(&set);
        > sigaddset(&set, SIGUSR1);
        > sigaddset(&set, SIGUSR2);
        > sigaddset(&set, SIGTERM);
        > sigaddset(&set, SIGINT);
        > sigaddset(&set, SIGSEGV);
        > sigaddset(&set, SIGBUS);
        > sigaddset(&set, SIGILL);
        > sigaddset(&set, SIGFPE);
        > sigaddset(&set, SIGSYS);
        > and I have created a pthread which contains
        > sigwait(&set, &signal);

        I'm not quite sure what you're trying to do here --- AIUI sigwait()
        blocks until a signal *for the blocked thread* is received, so no code
        that can generate the signal will be run... (unless you send it manually
        from another thread).

        If you're trying to catch a signal thrown by code in a different thread,
        then I don't think that will work.

        The simplest approach is to just register a SIGSEGV signal handler in
        the thread that's going to be doing the work. Then, when the signal is
        thrown, your handler will run. If you want to do the work in a different
        thread, then you'll need some sort of IPC between threads (but beware
        that you can only run a small subset of functions safely from inside a
        signal handler).

        A more complicated and slower but much more flexible approach is to use
        ptrace to monitor the running thread; when a signal is thrown, it will
        be suspended and your monitor will be sent a message. It all depends
        what you want to do.

        (Also, check out libsigsegv: http://libsigsegv.sourceforge.net)

        --
        ┌─── dg@cowlark.com ───── http://www.cowlark.com ─────

        │ "Never attribute to malice what can be adequately explained by
        │ stupidity." --- Nick Diamos (Hanlon's Razor)
      • clerew5
        ... Ah! I had thought that all signals would be passed to all threads (which is indeed the case for signals arising from outside). ... Yes, I have now set up a
        Message 3 of 5 , May 18 3:09 PM
        • 0 Attachment
          --- In nslu2-linux@yahoogroups.com, David Given <dg@...> wrote:

          > I'm not quite sure what you're trying to do here --- AIUI sigwait()
          > blocks until a signal *for the blocked thread* is received, so no code
          > that can generate the signal will be run... (unless you send it manually
          > from another thread).

          Ah! I had thought that all signals would be passed to all threads (which is indeed the case for signals arising from outside).
          >
          > If you're trying to catch a signal thrown by code in a different thread,
          > then I don't think that will work.
          >
          > The simplest approach is to just register a SIGSEGV signal handler in
          > the thread that's going to be doing the work. Then, when the signal is
          > thrown, your handler will run. If you want to do the work in a different
          > thread, then you'll need some sort of IPC between threads (but beware
          > that you can only run a small subset of functions safely from inside a
          > signal handler).

          Yes, I have now set up a sigaction-type handler in main(), and it seems to be inherited by processes created subsequently to that. It sort of works, but I need to know where the SIGSEGV came from, in order to know where to look for the bug. Here is my current handler:

          void errHandler(int signum, siginfo_t *info, void *ptr) {
          int count = backtrace( tracePtrs, 100 );
          backtrace_symbols_fd(tracePtrs, count, 2);
          fatal("Heat failed with %s: si_code=%d",
          strsignal(signum), info->si_code);
          }

          errAction.sa_sigaction = errHandler;
          errAction.sa_flags = SA_SIGINFO;
          sigaction(SIGSEGV, &errAction, NULL);
          sigaction(SIGBUS, &errAction, NULL);
          sigaction(SIGILL, &errAction, NULL);
          sigaction(SIGFPE, &errAction, NULL);
          sigaction(SIGSYS, &errAction, NULL);


          The call to 'fatal' seems to work fine (given that it has to deal with all those signals, that is as much information as I can extract from info).

          But, for tracing the bug, I need to know where the signal came from. Slugos 5.3 doesn't do core dumps (for good reason), so I have tried to generate a backtrace (#include <execinfo.h> - note that tracePtrs is declared globally). But the stack it prints bears no resemblance to what I get from 'bt' in gdb. It seems to start from errHandler, but after that it bears no resemblance to anything recognizable; and it is not just because it arises within a handler, because I have manually invoked it from elsewhere in the program, and it still does not work.

          So does anybody have any experience of using 'backtrace' in Slugos?

          My plan is to embed the whole program (which has to run 24/7) within a shell script which observes the failures, records what it can in a suitable file, and then restarts the program. But it is important that things should be cleaned up before the program is removed, which is why the call of 'fatal' is used - it then calls exit, which in turn calls all my 'atexit' functions, and calls the cleanup functions on those classes which have them.

          And indeed it caught several SIGSEGVs this morning (but they appear to occur randomly when noone is looking, and obviously I need to find the cause).
        • David Given
          On 18/05/12 23:09, clerew5 wrote: [...] ... This is definitely getting out of my comfort zone (signals and threads mix like oil and cats), but I was under the
          Message 4 of 5 , May 18 4:20 PM
          • 0 Attachment
            On 18/05/12 23:09, clerew5 wrote:
            [...]
            > Ah! I had thought that all signals would be passed to all threads (which is indeed the case for signals arising from outside).

            This is definitely getting out of my comfort zone (signals and threads
            mix like oil and cats), but I was under the impression that outside
            signals to sent to a single thread *at random* that had the signal
            unblocked? So you control which thread you want to receive the signals
            by blocking them from everywhere else.

            [...]
            > void errHandler(int signum, siginfo_t *info, void *ptr) {
            > int count = backtrace( tracePtrs, 100 );
            > backtrace_symbols_fd(tracePtrs, count, 2);
            [...]
            > But the stack it prints bears no resemblance to what I get from 'bt' in gdb. It seems to start from errHandler, but after that it bears no resemblance to anything recognizable; and it is not just because it arises within a handler, because I have manually invoked it from elsewhere in the program, and it still does not work.

            I've tried the test program in the backtrace man page on my armhf box,
            and it doesn't work. I'm afraid that it's possible that backtrace simply
            doesn't work on ARM.

            > My plan is to embed the whole program (which has to run 24/7) within a shell script which observes the failures, records what it can in a suitable file, and then restarts the program. But it is important that things should be cleaned up before the program is removed...

            You may be treading on thin ice here. Depending on what causes the seg
            fault, it's quite possible that the system will be in a bad state ---
            for example, if you call fwrite(), and the buffer is unreadable, then
            it's entirely likely that the signal will be thrown while it's in the
            middle of modifying the stdio state. Which means that trying to use
            stdio again will hang, crash, etc. The magic keyword to search for is
            'async signal safe'.

            (Related: there is a very limited list of operations that you can safely
            do inside a signal handler. Calling exit() is not one of them! See here:
            https://www.securecoding.cert.org/confluence/display/seccode/SIG30-C.+Call+only+asynchronous-safe+functions+within+signal+handlers)

            This means that once you're program's crashed, you may not be able to
            safely clean up afterwards.

            What sort of cleanup do you need to do? Recording the program's state,
            or freeing resources? If the latter, is there any way you can persuade a
            different process to do the cleanup for you? That way, you can just let
            your program crash without needing to do any actual work from your
            signal handler.

            --
            ┌─── dg@cowlark.com ───── http://www.cowlark.com ─────

            │ "Never attribute to malice what can be adequately explained by
            │ stupidity." --- Nick Diamos (Hanlon's Razor)
          • clerew5
            ... Yes, that is what I thought, and the thread in question was the only one allowed to see the signals (the prime purpose off that thread is to catch people
            Message 5 of 5 , May 21 2:43 AM
            • 0 Attachment
              --- In nslu2-linux@yahoogroups.com, David Given <dg@...> wrote:
              >
              > On 18/05/12 23:09, clerew5 wrote:
              > [...]
              > > Ah! I had thought that all signals would be passed to all threads (which is indeed the case for signals arising from outside).
              >
              > This is definitely getting out of my comfort zone (signals and threads
              > mix like oil and cats), but I was under the impression that outside
              > signals to sent to a single thread *at random* that had the signal
              > unblocked? So you control which thread you want to receive the signals
              > by blocking them from everywhere else.

              Yes, that is what I thought, and the thread in question was the only one allowed to see the signals (the prime purpose off that thread is to catch people who suddenly press buttons). But it seems that SIGSEGV (unless generated externally) is only sent to the thread it happened in, or maybe its parents.

              > > But the stack it prints bears no resemblance to what I get from 'bt' in gdb. It seems to start from errHandler, but after that it bears no resemblance to anything recognizable; and it is not just because it arises within a handler, because I have manually invoked it from elsewhere in the program, and it still does not work.
              >
              > I've tried the test program in the backtrace man page on my armhf box,
              > and it doesn't work. I'm afraid that it's possible that backtrace simply
              > doesn't work on ARM.

              Actually, it is doing better than I thought. Here is some actual output:

              ./heat(pthread_create+0x710)[0x9da0]
              /lib/libc.so.6(__default_rt_sa_restorer_v2+0x0)[0x4010bcf0]
              /lib/libc.so.6(_IO_printf+0x34)[0x40120dfc]
              ./heat[0x14a18]
              ./heat[0x17274]
              ./heat(pthread_create+0x680)[0x9d10]

              The addresses within [...] are indeed the addresses of code being obeyed all down the stack. But [0x9da0] is NOT within pthread_create (nm shows it is actually within my handler). Likewise [0x9d10] is within 'main', and [0x14a18] and [0x17274] are at identifiable places which gave me the clue as to where the fault was happening (though there appears to be one stack frame which does not appear at all).

              BUT I then assumed that the claimed routines shown within /lib/libc.so.6 would be equally bogus, whereas research using /proc/<pid>/maps and nm showed that they were in fact correct, and the fault was indeed in printf (I had mistyped '%d' as '%n').

              So there is definitely a bug in backtrace which is causing wrong identification within code compiled by myself, but which which operates fine within the system libraries.
              >
              > > My plan is to embed the whole program (which has to run 24/7) within a shell script which observes the failures, records what it can in a suitable file, and then restarts the program. But it is important that things should be cleaned up before the program is removed...
              >
              > You may be treading on thin ice here. Depending on what causes the seg
              > fault, it's quite possible that the system will be in a bad state ---
              > for example, if you call fwrite(), and the buffer is unreadable, then
              > it's entirely likely that the signal will be thrown while it's in the
              > middle of modifying the stdio state. Which means that trying to use
              > stdio again will hang, crash, etc. The magic keyword to search for is
              > 'async signal safe'.

              Yes, you have to be aware of what code you are going to obey, and what exit() is likely to do, and there is a risk that these might trigger the same fault again. But in my case the system is controlling a heating system and turning real boilers on and off, and it is absolutely essential that, upon a crash, the boiilers are most definitely left OFF. Also, there are a few variables that ought to be preserved in permanent storage.

              But so far it all seems to be working fine. The system crashed, restarted itself within one second, and crashed again shortly after - until eventually the temperatures had risen to a point where the offending piece of code did not need to be called anymore.
            Your message has been successfully submitted and would be delivered to recipients shortly.