Postmortems/Winter 2017 Mail incident
A user password was compromised by a spammer, who then used the account to send lots of spam. This was harmful to our mailserver performance and our domain and ip reputation. Queued email was deleted, including some legitimate mail. Messages sent by UGCS users were rejected or sent to spam.
January 27ish The user password is compromised by a spammer. Our mail server is immediately hijacked to send large quantities of spam as a authed user with a legitimate From: address.
January 27ish Automated limiters on receiving mailservers at aol and yahoo reject our mail. Google merely blackholes it. This causes thousands of messages to build up in the queue.
January 27-Febuary 13 The spammer continues to send spam through the server. The automated rate limiters are only temporary, and after each rate limit expires, the mailserver retries sending and gets banned again
Febuary 13 I look at the monitoring page and see the huge queue. I investigate and find that a user was compromised and is sending spam, and disable the account. Since nearly all deferred messages were spam, I deleted the entire queue.
- Our monitoring system did not have any automated alerts.
- We were not configured to receive feedback from any downstream mail providers
- We completely lack up/down monitoring and health checks
- We do not enforce any strength requirements on user passwords
- authenticated mail must still pass rbl checks, but the offending server was not in any of our blacklists.
- Some contributing factors were planned to be done earlier but never resolved
The offending account was disabled, and deferred messages were all deleted.
Mail sent by UGCS users during the incident was significantly delayed or not delivered to yahoo and aol addresses (and possibly others). Messages sent to gmail were delivered but almost always sent to spam. Our ip and domain reputation was significantly damaged, meaning that sending emails will be significantly harder until it recovers.
- Implement DKIM signing and configure feedback loop with google and yahoo
- add E2E mailing health checks, also for multiple providers
- Scan existing passwords for weak ones. Implement password strength checks in the shell (and possibly other password reset avenues)
- Investigate using more dnsrbl providers
- Implement a real issue tracker
- Add automated alerts to munin