Quantcast
Channel: iTWire - Entertainment
Viewing all articles
Browse latest Browse all 4710

Census 2016: no provision for 'graceful failover' of site

$
0
0
Census 2016: no provision for 'graceful failover' of site

The head of performance at a prominent Australian software testing company has expressed surprise that a more graceful failover of the census site was not planned, given the size of the task at hand.

Joel Deutscher of Planit, a 19-year-old Sydney-based company, told iTWire that he could not understate the importance of operational testing when launching a site such as the census website.

"Performance testing aims to simulate the load that will be placed on a system by real users, though performance testing is only a small part of the operational readiness testing that should have occurred as part of the delivery," he said.

"This includes graceful failover, something that commercial sites have been using for a long time. If you have ever tried to purchase tickets to a concert, you will notice you are often placed in a virtual queue on the website. It's unclear why something like this was not implemented given the risks involved."

{loadposition sam08}The census website failed miserably on Tuesday night and the fallout is continuing. iTWire has plenty of coverage here.w

Deutscher (seen below) was willing to answer questions at length and an edited version of the Q and A with him is below.

iTWire: While it would not have been possible to estimate the _actual_ load, any software tester should have had some idea about the approximate load. Revolution IT was paid to do testing. What did they do wrong?

Joel Deutscher: I don't think we know enough at this stage to really pinpoint what went wrong, though we can likely narrow it down to a few possible options. The first is that the requirements for building and testing the eCensus website were too low. It's been pointed out multiple times these levels do not seem high enough to support the expected number of submissions, though we all hope that if anyone has a good understanding of statistics, it's the ABS. It is of course possible that this was
pointed out by Revolution IT during their engagement, though as a consultancy, they can only advice their clients.

Another alternative is that the testing was insufficient due to a misunderstanding of the requirements. Different news articles have quoted both one million users on the site at the same time and others are quoting one million form submissions per hour. While it may seem like a minor difference in wording, the result could result in a massive difference in the effectiveness of the test. It is much easier for a system to handle people quickly logging in, entering their details and logging off, and this is an approach often taken in performance testing to reduce the tool licenses required. In a system such as the ABS's eCensus, the website is required to keep the state of each active user.

So the real question is, was this raised as a risk and was it ignored or was it a poorly planned or executed test? It's up to the ABS to provide full disclosure here on exactly what happened, the numbers tested and the volume during the outage. It would be even better if the ABS released the performance test plan and results.

Do you buy into the claim of DDoS attacks when there is no proof anywhere that one took place?

The motivation for a DDoS attack is certainly there; in terms of visibility, it's a high-profile target so it can't be ruled out, although there doesn't seem to be any clear evidence or any group claiming credit for doing so. Assuming that a DDoS attack did happen, this should definitely have been considered, and there are methods to reduce the impact of such an attack which were either not in place or not effective.

Joel Deutscher.A more logical explanation seems to be that more people than expected logged on at a single instance and that blocked the website from any further access. How does that explanation stack up?

The most logical explanation is that the website was simply overloaded.

How would you have gone about testing the set-up at the ABS to ensure that things were foolproof to the extent that one make them foolproof?

Performance testing aims to simulate the load that will be placed on a system by real users, though performance testing is only a small part of the operational readiness testing that should have occurred as part of the delivery. This includes graceful failover; this is something that commercial sites have been using for a long time. If you have ever tried to purchase tickets to a concert, you will notice you are often placed in a virtual queue on the website. It's unclear why something like this was not implemented given the risks involved.

What kind of software would you have used?

The testing tool in this case is less important than the planning. While commercial tools provide an excellent platform for performance testing, their cost can become prohibitive as licensing is based on the number of concurrent users. This can result in less accurate performance tests such as reducing the concurrent sessions on the site to save money.

What kind of redundancy would you have looked for?

Given the high-profile security risks here, I can see why the ABS avoided a number of commercial services that provide redundancy. What I would have expected to see, though, is a more graceful failover of the site. I can't understate the importance of operational testing when launching a site such as this.

Where does the blame lie: IBM? ABS? Revolution IT?

Unfortunately, there is no simple answer to this, and I don't think we know enough yet to point the finger at a particular organisation. 

How would you go about convincing the public about the technological integrity of a system such as this, in the face of what has happened? How would you have done it before the census was held?

Designing a system to handle this type of traffic is difficult, and there are lots of things that can go wrong. I don't think that this has helped the confidence level of the public. I certainly would have highlighted that the census could have been completed until mid-September, and played down the reports of fines for not completing it. The fear associated with not completing the census on "census night" certainly helped contribute to the issue.

Any other comments?

The best thing the ABS can do now is publish their performance test plan and results to help convince the public they did everything possible to ensure success. This will be the only way we will really know what happened.


Viewing all articles
Browse latest Browse all 4710

Trending Articles