We have configured two perdition servers (as front ends to 4 dovecot), using eDirectory as an LDAP backend, with anonymous queries. The perdition servers are load balanced using round robin (I also used least conns for a while) via a 6513 3BXL switch, using the embedded load balancer in the IOS, and not the dedicated blade. The hosts are two HP DL140 machines. The clients connect with either imap/imaps/pop3/pop3s, and we connect to the backend servers with either imap or pop3. The conf is below. On the LDAP side, everything is properly indexed. connection_logging connect_relog 0 F mail g nobody imap_capability IMAP4rev1 SASL-IR SORT THREAD=REFERENCES MULTIAPPEND UNSELECT LITERAL+ IDLE CHILDREN NAMESPACE LOGIN-REFERRALS STARTTLS map_library /usr/lib64/libperditiondb_ldap.so map_library_opt "ldap://ldapserver:389/o=someorg?cn,nSCPAmailHost?sub?(& (uid=%25s)(objectClass=nSCPMailRecipient)(! (nSCPAmailMessageStore=inactive*)))" server_resp_line outgoing_server  imapold.somedomain S all timeout 0 u nobody ssl_ca_accept_self_signed ssl_cert_file /etc/perdition/perdition.crt.pem ssl_cert_accept_self_signed ssl_cert_accept_expired ssl_cert_accept_not_yet_valid ssl_key_file /etc/perdition/perdition.key.pem ssl_no_cert_verify ssl_no_cn_verify     Every 10-15 minutes on the average, one of the perdition client processes (a fork from one of the 4 listeners - imap/imaps/pop3/pop3s) enters a loop (easily seen both with strace and while in that loop, the CPU the process is running on is at 100% usage.    For now, I've written a small health check monitor that checks for these runaway processes, and kills them.    While I cannot run perdition in full debug mode to check what is happening (due to the load of connections we get here), I can share the details I have, from both the logs and ltrace/strace.

    The logs show Re-Authentication failure for each of these sessions.... The ltrace of the looping process looks like this: select(1024, 0x7fffed4aa940, 0, 0x7fffed4aa9c0, 0x7fffed4aa8a0) = 0 time(NULL) = 1231363480 vanessa_list_get_element(0x83a9ef0, 0x7fffed4aa644, 0x7fffed4aa690, 5, 0x7fffed4aa8a0) = 0x8397aa0 SSL_pending(0x8397250, 0x7fffed4aa644, 0x8397aa0, 5, 0x7fffed4aa8a0) = 0 select(1024, 0x7fffed4aa940, 0, 0x7fffed4aa9c0, 0x7fffed4aa8a0) = 0 time(NULL) = 1231363480 vanessa_list_get_element(0x83a9ef0, 0x7fffed4aa644, 0x7fffed4aa690, 5, 0x7fffed4aa8a0) = 0x8397aa0 SSL_pending(0x8397250, 0x7fffed4aa644, 0x8397aa0, 5, 0x7fffed4aa8a0) = 0 select(1024, 0x7fffed4aa940, 0, 0x7fffed4aa9c0, 0x7fffed4aa8a0) = 0 time(NULL) = 1231363480 vanessa_list_get_element(0x83a9ef0, 0x7fffed4aa644, 0x7fffed4aa690, 5, 0x7fffed4aa8a0) = 0x8397aa0 SSL_pending(0x8397250, 0x7fffed4aa644, 0x8397aa0, 5, 0x7fffed4aa8a0) = 0 select(1024, 0x7fffed4aa940, 0, 0x7fffed4aa9c0, 0x7fffed4aa8a0) = 0 time(NULL) = 1231363480 The strace looks like this: select(1024, [5], NULL, [5], {0, 0}) = 0 (Timeout) select(1024, [5], NULL, [5], {0, 0}) = 0 (Timeout) ...... Does any of you have an idea about what may be wrong ? best. --Ariel -- Ariel Biener, CISO Tel-Aviv University CIT div. e-mail: ariel@aristo.tau.ac.il phone: 03-6406086 PGP key: http://www.tau.ac.il/~ariel/pgp.html