Information regarding yesterdays network outage
As many of you will know we experienced a major outage yesterday across all our servers. This was due to a border firewall failure at our network provider. They had experts working on the issue but a sequence of unexpected events took place which further complicated the issue. These are detailed below. As with the majority of hosting providers we are dependent on our upstream network providers and can assure you we will be working with them to review this incident.
We apologise for the inconvenience caused to you. If you have any comments or feedback please let us know and we will take them all onboard. We are confident that the cause of this is being addressed with improvements to the network but will be reviewing this incident
For those who want more details this is a summary of the information from our providers explaining the sequence of events:
"During the early afternoon of 2nd November one of our border firewalls failed at our Redditch site. We have had a few smaller firewall issues recently and as a result have been rebuilding a replacement fail over pair to take over from the current setup. We have spare equipment and so were able to attend site very quickly. We quickly restored a backup to a replacement unit. This went well and the new firewall came online, but was not very stable as it looked like it was taking a hit from an attack.
At this point we decided to move forward the implementation of the new set of firewalls. What we were hoping was that we could export the settings to the new equipment and be back online. The import went well and the firewall rebooted quite happily and came online.
It was at this point we realised that only 300 rules had been imported and that the unit was now suffering from a hard limit even though it can support 42,000 rules and not just 300 !! After a further hour we were able to get more of the ruleset in place and by around 10pm 80% or more of customers had full service.
Without the ability to import more than 300 rules from the backup we faced a large problem. At 23:30 we started the rebuild of the config from copies we held for the old firewalls. We have already added in several thousands rules to the new setup, but we still needed a further 24 man hours to complete it. 2 of us worked in shifts through the night to the point we are at now 7am.
We have the current firewall stable and seems to be closer to 95% correct (if you spot an issue, please do let us know). We are about 80% of the way through the new configuration and hope to be able to put this in place later today/this evening. We feel at this time it is better to work with the unit we have running rather than to make live changes during working hours.
So I know many people are going to be asking why or how can this happen? We have known for a few months that the system we had was starting to become too small for our needs and had started making plans for upgrade. These were scheduled for next week and this is proved by the number of hours we have already put in to the rebuild.
We held a backup unit in case of failure, but this was not able to cope with the “attack” traffic we were seeing and actually possibly the original cause of the firewall failure. The new unit completely failing when only a few months old is very odd and something we had not prepared well enough for.
We believe that by tomorrow morning we will have the new units fully in place and running.
We will be doing a review of our processes to see how we have handled this situation. We understand some of you will feel we have let you down and understand that. We are committed to creating and maintaining a stable network. We believe we have taken the right steps to address the problem that we faced, but that in this we caused inconvenience and for that I apologise."