Cable / Telecom News

Rogers outage traced to software hiccup


TORONTO – Bob Berner had a long, sleepless night Wednesday. As previously reported, many Rogers Wireless customers were left with phones that could get online late that evening, but could not make phone calls nor text.

A software malfunction at one of its three mobile switching centres (the machines that handle and route all of its wireless customers’ calls) led to the hours-long outage. “All sorts of things happened in about the space of a minute and so the trick in determining the root cause was figuring out what happened sequentially and what happened in parallel,” Berner, Rogers’ chief technical officer, told Cartt.ca in an interview.

Like most carriers, Rogers runs side-by-side signalling and traffic-carrying networks and it has three robust MSCs (one each in Vancouver, Toronto and Montreal) to handle all of its voice traffic. A software glitch in one of them (Berner did not say which one) caused it to cycle through a restart. That shouldn’t be an issue, he explained, because two MSCs should be able to handle the customer load.

“In the process of reloading, it went to assign all the traffic it normally carries to another MSC, which is what it is supposed to do,” Berner explained, about the malfunction. So, there was an unprecedented signalling surge sent to the other two MSCs. “It overflowed the capabilities of the other MSCs as it was trying to re-parent all the customers to… the software in the two MSCs which (traffic) was reparenting to, was not reacting as well as it should have to this sudden spike in loads, but we had never seen this before.”

So, Rogers quickly turned to its network partner, Ericsson, which came up with a software fix. “The vendor produced a software correction which had the switches, under a peak or spike signalling load condition, not go into cyclical restarts,” Berner explained.

The fix worked and customers were immediately back talking and texting – and Rogers customers will be getting a day’s rebate for their inconvenience. Berner believes the problem is fixed and won’t happen again. “The reason it won’t is that the software correction that was put in essentially creates an environment where the overload is absorbed and doesn’t cause the machine to give up and restart.”

– Greg O’Brien