Sysadmin Documentation

Jump to: navigation, search

This describes the basic software setup for all servers on the UGCS5 system. All servers will have this same basic configuration unless otherwise noted.

Current core servers:

zinc (fileserver)

iron (ldap server)

cadmium (email server)

gold(web server)

Non-core UGCS servers:

tin (monitoring server)

aluminum (app server)

chromium (sites server)

Additional servers: haru and mako (GPU servers)


burning fires:

  • properly reject overquota mail
  • dkim sign messages and sign up for feedback loop
  • clean up postfix config
  • ldaps cert
  • fix the letsencrypt crons
  • backups!

short term:

  • monitoring alerts !! !!
  • ELK for logs
  • Owncloud (on web server)
  • inconsistent disk io across raid on gold
  • clean rebuild tin because it's shat up
  • figure out what's wrong with aufs on the new kernel ???
  • make seperate php fpms for the vhosts
  • munin mdadm
  • munin mysql
  • radius server (port to iron)
  • mathematica + matlab
  • node on the shells
  • find victims
  • maybe move dovecot indexes to another filesystem (low pri)
  • gogs emails??? (fork and pull gogs here)
  • Outgoing email encryption

long term:

  • Virtual private cloud
  • expand the fileserver????
  • replace shitass poweredges
  • $$$$
  • find victims
  • messageboard???
  • space
  • move sites to separate box

completed tasks:

  • cadmium mailqueue
  • update MW on main site
  • update packages on all the servers (10/16)
  • self service password
  • mailman stuff, finish postfix
  • integrate cuda machines
  • letsencrypt certificate updating
  • phabricator/jira/whatever
  • shell02
  • sql signup page
  • set up LE again

Won't do

  • squirrelmail or roundcube on webserver
  • local mail reading on shellservers (?)




Where to put shit


  • no external package sources!!
  • users can't run stuff on core servers
  • keep the ldap simple
  • fix all errors even if shit still works
  • write them docs
  • be able to do everything manually
  • no chef/puppet/whatever bullshit
  • no encryption except data in transit to the outside world
  • no hsts ever
  • apps run by apache don't come from packages (so burned by owncloud and mediawiki)
  • everything else comes from a package, especially if it has a daemon
  • fix the shell netboot

Incident Postmortems

Winter 2017 Mail Incident

Partition Setup

There are two main hard drive schemes, depending on the number of drives. In both cases, they are designed to lose at least one hard drive before failing, and can boot off of any of the disks.

Two Hard Drives

Both drives are partitioned as follows - a bootable 2GB partition used for RAID, a (total ram)GB swap partition, and a (rest of disk)GB partition used for RAID. It is important that both boot partitions are set bootable.

We create two mdadm devices, and order is important. Assuming the two disks are the only SATA objects installed, the first (md0) is made from /dev/sda1 and /dev/sdb1 (the two boot partitions). The second (md1) is made from /dev/sda3 and /dev/sdb3/ (the two rest-of-drive partitions).

Since the boot information needs to be available before LVM starts, we format md0 as ext4 and mount it on /boot. We then format md1 as a physical device for LVM.

The LVM settings vary somewhat by server role, but all follow a standard. First create a volume group with the name of the server (gold, zinc, etc), using /dev/md1 as the physical device. Then, create a logical volume with the name of "root" and space 10GB. Create a second with the name "tmp," also 10GB, and finally "var." The var logical volume can be of various sizes, and should be set 10GB or more depending on the amount of data.

Finally, the root, tmp, and var logical volumes should be formatted as ext4 and mounted in the appropriate places.

More Hard Drives

For the file and mail servers, the procedure is the same, with the exception that md1 is a RAID 5 or 6 device constructed out of the third partition of every drive. Furthermore, we create a fourth logical volume to hold data or mail after the operating system is installed.

Operating System

All systems are running the latest version of Debian 8 "Jessie" x64. There is a standard user. The install tasks for "standard system utilities" and "ssh server" are selected. During the installation, grub should be installed on the root of all drives.

The current preferred package mirror is contrib and non-free packages are enabled. Don't forget to run apt-get update after changing the sources.


  • sssd-ldap
  • ldap-utils
  • sudo
  • denyhosts (or equivalent, not implemented)

Additionally, additional tools should be installed for easier administration

  • vim
  • bash-completion
  • ipmitool
  • htop
  • python-ldap

These tools are not directly necessary for the operation of the ugcs core servers, but are very useful. It is not critical to install these on all servers, but when a new server is set up they should be installed for consistency. If a server is missing one of these packages, install it immediately.

Network Setup

Required configuration files:

  • /etc/network/interfaces
  • /etc/resolv.conf

UGCS servers are currently set up to steal ips from the DHCP pool. (don't tell IMSS). Make sure to change the "allow-hotplug" to "auto"

DNS resolution is configured in the resolv.conf file, not in /etc/network/interfaces. Servers should be configured to use the local UGCS nameserver on the fileserver, rather than the caltech nameserver.

Authentication and Authorization

Required configuration files:

  • /etc/sssd/sssd.conf (pull from /ugcs)
  • /etc/ssh/sshd_config
  • /etc/ldap/ldap.conf (pull from /ugcs)
  • /etc/sudoers
  • /etc/nsswitch.conf

All servers on the ugcs cluster have a local administrator account "bofh" that can be used to login to the server in the case of an LDAP failure. This account has sudo capability defined in the sudoers file.

The sysadmins group must be added to the sudo file.

Authentication and authorization is provided from the LDAP server using the sssd_ldap package. sssd accesses ldap with a daemon that is configured with the sssd.config file. nsswitch.config is changed to get nameservice information from sssd

On UGCS core servers, login is restricted to only sysadmins. This is done using the AllowGroups directive in sshd_config, which will only allow members of the sysadmins or "bofh" groups to login.

SSSD automatically adds the pam module config to the correct places. After configuring the pam configs, pam-auth-update must be run to update the actual pam files.

NFS and other cluster stuff

Required configuration files:

  • /etc/fstab

All servers mount the two NFS shares from zinc, ( (home and ugcs) on /mnt/home and /mnt/ugcs, from which /ugcs and /home are symlinked.

Other machines might have other local data needed. Local volumes should be mounted in /mnt/data/xxx, symlinked to /srv/xxx if needed, and other machines should mount that over nfs as /mnt/xxx. This convention is kind of silly but that's what it is.


Each server is monitored by Munin for graphing and stuff. We install munin-node, and allow all tin to connect by editing munin-node.conf


Each server can send outgoing mail for any various purposes. These are configured to send mail using the mail server as a relay for all messages.

We will use the preinstalled exim4 because laziness despite the fact that it's a shitty mta. We select "mail sent by smarthost no local mail" and set no "other destinations" and set as our outgoing smarthost (not cadmium, because it's convention to refer to the machines by role). Other than that leave everything blank.

We then edit the /etc/aliases to forward all machine messages over to the sysadmins root list, so that anything that happens will get sent back to the admins.

We also need to edit the config on the mailserver to relay for that server's ip.



User Creation