Gamma Horizon issues – what we saw, what we did and the impact on customers

From James Rochester, Business Development Executive, Gamma Horizon

Once again Gamma would like to apologise for the service issues we saw yesterday. What follows is a brief statement on what we saw, what we did, the impact on customers and what have done since and are doing.

We have previously seen good stability across this platform, the only other outage we’ve had on Horizon this year was on the 21st March and impacted the supplementary services (Client, Integrator etc) – prior to that the last outage was November 2017 (BT issue impacting 831 users) and before that was March 2016.  The platform has been running at 99.99% availability since 2014 and 100% availability in 2018 prior to yesterday.

All our efforts are pointed toward bringing our service back to and beyond the levels you expect over the long term, helping our channel partners wherever we can to restore customer faith.   A recording of the conference call open to all customers, held by CEO Andrew Taylor and COO Andy Morris this morning, is available here:

https://globalmeet.webcasts.com/viewer/event.jsp?ei=1221646&tp_key=528a14fa94

Password: 141118

What Happened

  • At approximately 09:30 yesterday morning we monitored a large proportion of Horizon devices failing to register to the platform via 1 of 8 application clusters, hosted across 4 main data centres.
  • The root cause was a bug on the platform discovered during the early hours of the morning.  The significance, in terms of impact to services, not understood until peak hours.
  • A patch to resolve the bug was tested and deployed (in line with the recommendation from our software vendors Broadsoft) into the live environment, effective around 10am.
  • The patch successfully resolved the bug on the application server, however the knock-on impact was that our SBCs became overloaded by registration retries from the devices that were trying to connect.  This is because Horizon handsets have a resilient configuration which tries a second and third SBC if the first doesn’t respond, this further increases the number of retries across the SBC’s and creates a bow wave of activity.
  • As the SBCs become overloaded they automatically enter a protective state, which enables them to continue to operate at a level that prevents them from becoming overloaded.
  • The resulting technical impact being a slow response to signalling messages and timeouts triggering reattempts by Horizon devices.
  • We also had reports of the Horizon web portal used to deploy diverts, being slow to respond, or unreachable, or diverts failing to become active in a timely manner.  We received about 3 times the number of concurrent users/active sessions on the Horizon portal than we’d expect on a busy weekday. As a result of this, the Horizon portal, failed to deal with all requests in a timely manner, meaning users saw page timeouts and were presented with generic server error messages (“error 500”).

 

What We Did

  • Much of the day was spent understanding the options and taking a series of steps to reduce the overall load across the SBCs and allow recovery, this involved work with vendors Oracle and Siphon on how we might limit call traffic and device registrations to provide that space.
  • We implemented a number of measures e.g. failing over our SBCs, restarting them in sequence, changing the config on the horizon handsets themselves to only allow them to register with one SBC rather than round robin across all four.
  • At 16:50 we started applying the config to our SBCs that would resolve the incident fully.  We tackled this sequentially, one SBC at a time applying the configuration change, carrying out a restart to invoke the change and then monitoring recovery for around 15-20 minutes.
  • Our service management notifications were placed approximately every 30 minutes – they were as open and honest as we could be, being careful to base them on what is known rather than what could be.   We recognise this caused some frustration and have resolved to provide more detailed information in the future, perhaps more information on suspected or likely causes.
  • In an attempt to remedy the Horizon portal performance issues , the Horizon development team applied a configuration change to increase the processing capability available to each of our Horizon web server applications, effectively granting them access to additional server resources to process the web page requests.

The Impact

  • From around 9:30am this morning until 17:48 pm today customers will have had a mixed experience including:
    • being able to make and receive calls in-between the device reregistering,
    • not having been able to make or receive any calls,
    • poor call quality when able to make calls then experience issues with media.
  • We saw 65% of normal traffic levels across the Horizon platform fairly consistently throughout the day.  Some of that traffic being successful calls, some down to the media problems themselves, we will continue to investigate.
  • From 16:50 to 17:48 we saw evidence across device management and traffic instruments that indicate a fully restored and stable service – this would have occurred in “chunks” in line with each SBC restart.

Recovery Actions Completed and Ongoing

  • We have normalised all changes made during the incident to aid recovery across both handsets and SBC’s.
  • We have applied the patch across all Application Clusters and Network Servers to mitigate a repeat of the issue.
  • Analysis of platform and SBC logs are being analysed by our vendors to ensure the issue is fully understood.
  • We have seen stability throughout Thursday 15th November.
  • Investigation into the portal issues and capacity to deal with situations as per yesterday.