Post-Mortem partial outage on 07/19/16
Yesterday, on Monday 07/18/16, we have upgraded our infrastructure by replacing our existing load balancer for our API with a new one, running on more powerful hardware. The transition itself worked without any problem and the traffic has been shifted from the old load balancer to the new one within a few minutes. From a high-level infrastructure architecture view, the load balancer is the first machine where requests from all our clients come in, and after some pre-processing the requests are then distributed to a cluster of application servers which then do the actual work.
Due to a misconfiguration of the logging subsystem on the new load balancer, logs created during the normal operation were not automatically cleaned up (or rotated) as required. This caused the disk to steadily fill up. At some point, not a single byte was left on the disk and the load balancer could not fulfill all requests anymore. Our external monitoring reported a first partial outage at 11:11pm which continued throughout the night until 06:52am. During that time, our service has been partially down and service was severly impacted. We provide an external status page of our service here.
We always have one engineer on pager duty in order to be able to respond to service interruptions as soon as possible. Due to a not yet known problem, the monitoring registered the partial outage, but did not call the pager in order to alert the responsible engineer. When the interruptions started at 11:11pm local time, the responsible engineer was already sleeping and was not alerted by the pager. In the morning, the problem has immediately been identified and fixed.
We have designed Boxcryptor in such a way that the clients do not require a connection to our servers most of the time - it is only a hard requirement for the following actions or regular users: creating a new account, logging in to an existing account, modifying permissions and managing groups. If a user is already logged in, he is not affected by the availability of our service so that the vast majority of our users might not even have noticed the service interruption.
However, I am deeply sorry for all users who have been affected and for example could not sign in to their accounts. We will investigate why the pager alerting did not work as required, implement the necessary changes to fix it and also re-evaluate the logging configuration on all of our servers.
Co-Founder & CTO
PS: Every user has the option to export the keys stored on our servers in order to create a local backup file. When our service should be interrupted for a longer period or even be shut down completely, this key file can be used to continue to use Boxcryptor and access all your encrypted files - independent from our existence.