Postmortems/Winter 2017 Mail incident

From UGCS
Jump to: navigation, search

Description

A user password was compromised by a spammer, who then used the account to send lots of spam. This was harmful to our mailserver performance and our domain and ip reputation. Queued email was deleted, including some legitimate mail. Messages sent by UGCS users were rejected or sent to spam.

Timeline

January 27ish The user password is compromised by a spammer. Our mail server is immediately hijacked to send large quantities of spam as a authed user with a legitimate From: address.

January 27ish Automated limiters on receiving mailservers at aol and yahoo reject our mail. Google merely blackholes it. This causes thousands of messages to build up in the queue.

January 27-Febuary 13 The spammer continues to send spam through the server. The automated rate limiters are only temporary, and after each rate limit expires, the mailserver retries sending and gets banned again

Febuary 13 I look at the monitoring page and see the huge queue. I investigate and find that a user was compromised and is sending spam, and disable the account. Since nearly all deferred messages were spam, I deleted the entire queue.

Contributing Factor(s)

  • Our monitoring system did not have any automated alerts.
  • We were not configured to receive feedback from any downstream mail providers
  • We completely lack up/down monitoring and health checks
  • We do not enforce any strength requirements on user passwords
  • authenticated mail must still pass rbl checks, but the offending server was not in any of our blacklists.
  • Some contributing factors were planned to be done earlier but never resolved

Stabilization Steps

The offending account was disabled, and deferred messages were all deleted.

Impact

Mail sent by UGCS users during the incident was significantly delayed or not delivered to yahoo and aol addresses (and possibly others). Messages sent to gmail were delivered but almost always sent to spam. Our ip and domain reputation was significantly damaged, meaning that sending emails will be significantly harder until it recovers.

Corrective Actions

  • Implement DKIM signing and configure feedback loop with google and yahoo
  • add E2E mailing health checks, also for multiple providers
  • Scan existing passwords for weak ones. Implement password strength checks in the shell (and possibly other password reset avenues)
  • Investigate using more dnsrbl providers

done

  • Implement a real issue tracker
  • Add automated alerts to munin