Re: smtp threads not spawning?
- On Mon, May 13, 2013 at 12:57:06PM -0600, Curtis wrote:
> We are seeing an intermittent issue in our Postfix logs where we seeJust to make the language less jarring, Postfix is not "multi-threaded",
> all outbound threads (smtp) stop delivering email or logging
> anything while the active queue continues to grow.
these are "processes" not "threads". Of course each process is its own
single thread, but this is not the usual way to talk about the situation.
> This indicates to me that all active smtp threads are hanging,How long is the time between the last outbound delivery before everything
> since nothing from the smtp threads are recorded in the logs at all.
freezes and the first delivery once the freeze stops. The length of
the delay may reveal its origin. Report both log messages.
> During this time, inbound email is coming in fine and smtpd continuesIdle smtp(8) processes that are not given any work exit after 100s,
> to log activity, while the smtp threads slowly die one by one, over
> the course of several minutes.
each process otherwise waits its turn to accept new delivery requests
by acquiring an exclusive lock on the /var/spool/postfix/pid/unix.smtp
file and then blocking to accept a queue manager request.
If your kernel unix-domain socket code has a bug where a connection
request from the queue manager is "lost" (fails to be delivered to
the process holding the lock that is making the accept(2) system call,
then you'll observe the freeze you're reporting until the queue manager
gives up and tries another connection.
That timeout is 18000s or 5hours. Another kernel bug I've seen on some
systems loses data sent beween the queue manager and trivial rewrite,
this freezes the queue manager, and mail delivery stops. A watchdog
timer causes a queue manager exit after 1000s. Then delivery resumes.
> Once all smtp threads finally die, theDoes the queue manager process number change when this happens?
> number of smtp threads instantly jumps to the max of 110 and outbound
> email delivery (and logging) continues.
Likely your queue manager restarted. Any warnings, errors, panics,
fatal errors etc., logged by the queue manager?
> We are going to try to catch it when it's actually happening so thatWhen you say "active" queue, you are perhaps speaking loosely.
> we can run an strace on one of these hung smtp processes... but I'm
> curious if these symptoms are something others have seen before and
> could share some insights as to the possible cause. What I'm most
> curious about is why would Postfix wait for all existing smtp
> threads to die before spawning new threads to handle a rapidly
> growing active queue?
More likely the queue manager is frozen and all the mail is piling in
"incoming". In which case logging will show lots of mail coming into
"active" (logged by qmgr) in parallel with the start of deliveries.
> >> So, if qmgr is still running, then my question remains the same... sinceOh, please.
> >> the active queue is growing what are possible reasons why new smtp
> >> threads would not be spawning until every last active thread gives up on
> >> this non-responsive mail server?
> > See the first example in my first reply: all mail is sent to the
> > deferred queue.
> > $ grep 'status=deferred' /the/maillog/file
> Sorry, I need to figure out how to simplify my question as the extra
> details I've provided keep getting you focused on the wrong thing (I
> understand that you're probably just skimming my emails... I'm impressed
> that you have time to answer at all).
> Yes, at the time of each incident, there are a few threads thatThe queue manager WILL NOT READ THE QUEUE when the maximum
> eventually time out and throw a few emails into the deferred queue. That
> does not concern me. What concerns me is that while Postfix is waiting
> for these few threads to time out, the active queue is completely
> ignored and is growing rapidly.
number of processes is delivering mail, or when the maximum
concurrency per destination is reached.
So while those processes are timing out talking to a broken device,
the queue manager WILL NOT READ THE QUEUE.
Is this clear now?
- On Mon, May 13, 2013 at 02:46:04PM -0600, Curtis wrote:
> Ok, we have confirmed that the postfix/smtp threads are not justIf every smtp delivery agent is timing out sending mail to the same
> hanging... after several minutes of logging nothing, each thread
> exits with a log entries that looks like this (real host names/IPs
> masked with ---):
> May 9 13:36:50 --- postfix/smtp: 3b3cyK07Bzz41vV6:
> conversation with ---.---.com[---.---.---.---] timed out while
> sending message body
> May 9 13:36:51 --- postfix/smtp: 3b3cyK07Bzz41vV6: enabling
> PIX workarounds: disable_esmtp delay_dotcrlf for
> May 9 13:46:53 --- postfix/smtp: 3b3cyK07Bzz41vV6:
> to=<---@...>, relay=---.---.com[---.---.---.---]:25,
> delay=349260, delays=348054/0.01/604/602, dsn=4.4.2, status=deferred
> (conversation with ---.---.com[---.---.---.---] timed out while
> sending message body)
> ...with it being the same mail host that every active thread is
> trying to deliver to and eventually times out on.
destination, then part of the delay is due to all processes being busy
waiting for deliveries to that destination to complete. Have you set
a high destination concurrency limit for this transport? It is time to
share your main.cf (postconf -n) and master.cf settings.
If all the smtp(8) processes exit at essentially the same time, and
previously all were busy, you could see the behaviour you describe when
all mail goes to this same destination. For that you need to fix the
pix firewall, it is broken.
> So, if qmgr is still running, then my question remains the same...Postfix has process limits, destination concurrency limits, ... If
> since the active queue is growing what are possible reasons why new
> smtp threads would not be spawning until every last active thread
> gives up on this non-responsive mail server?
enough gunk ties up all available resources the system will appear
If only the output is blocked, you'll indeed new mail entering the active
queue (qmgr from=... logging) while no output deliveries take place.
Do you see this or not? Time to share some log samples showing typical
activity before, during and after the "freeze"
On 5/13/2013 5:29 PM, Wietse Venema wrote:
>> Yes, at the time of each incident, there are a few threads that
>> eventually time out and throw a few emails into the deferred queue. That
>> does not concern me. What concerns me is that while Postfix is waiting
>> for these few threads to time out, the active queue is completely
>> ignored and is growing rapidly.
> The queue manager WILL NOT READ THE QUEUE when the maximum
> number of processes is delivering mail,
Understood. During these incidents, we are no where close to the
maximum number of processes.
> or when the maximum
> concurrency per destination is reached.
> So while those processes are timing out talking to a broken device,
> the queue manager WILL NOT READ THE QUEUE.
In the cases where this has happened, the processes that are talking to
the broken device/destination are only utilizing a handful of smtp
processes, during which time, the active queue is building up a huge
backlog with messages that *do* have deliverable destinations.
After this small handful of processes finally time out, that is when the
flood gates open, and Postfix again starts delivering successfully
again... it is then that it starts using 110 processes in an effort to
catch up. I am trying to determine what is causing Postfix to wait
until every last process times out before it starts spawning new threads.
I'm out of time today. Hopefully I'll have time in the morning to
respond to the list of questions from Viktor Dukhovni... perhaps in the
process of answering his questions I'll stumble into what's causing my
Thanks for your efforts... they are appreciated.
> Is this clear now?