Software Engineering Case Studies

The Maroochy Water Breach, Australia (2000)
A cybersecurity incident that led to the spillage of sewage in a region of Australia.  Study the sequence of events leading to the failure of a sewage system in Australia and the subsequent discovery of malicious attacks on the system. The case study discusses an incident in Australia where a malicious insider reprogrammed a sewage system controller to discharge raw sewage. His motivation was that he was not offered a job with the company that developed the system.
YouTube Video – Maroochy Water Breach – Ian Sommerville
(https://www.youtube.com/watch?v=C_PRhTXp6VQ)

Published on 26 Jun 2013
A video that descibes a cyber attack on a critical infrastructure (sewage system) in Australia. This was an insider attack that focused on changing the configuration of the controlling SCADA system
Presentation Slides – Maroochy Water Breach – Ian Sommerville
(http://www.slideshare.net/sommerville-videos/maroochy-water-breach)

Transcript

  1. Maroochy SCADA attack, 2013 Slide 1Cybersecurity Case StudyMaroochy water breachhttp://www.slideshare.net/sommervi/cs5032-case-study-maroochy-water-breach
  2. Maroochy SCADA attack, 2013 Slide 2Maroochy ShireImage credit:http://www.hinterlandtourism.com.au/attractions/the-maroochy-river/
  3. Maroochy SCADA attack, 2013 Slide 3Maroochy shire sewage system• SCADA controlled system with 142 pumpingstations over 1157 sq km installed in 1999• In 2000, the area sewage system had 47unexpected faults causing extensive sewagespillage
  4. Maroochy SCADA attack, 2013 Slide 4SCADA setupTypical SCADA-controlled sewage systemThis is not the system that was attacked
  5. Maroochy SCADA attack, 2013 Slide 5SCADA sewage control• Special-purpose control computer at eachstation to control valves and alarms• Each system communicates with and iscontrolled by central control centre• Communications between pumping stationsand control centre by radio, rather than wirednetwork
  6. Maroochy SCADA attack, 2013 Slide 6What happenedMore than 1m litres of untreated sewage releasedinto waterways and local parks
  7. Maroochy SCADA attack, 2013 Slide 7Technical problems• Sewage pumps not operating when theyshould have been• Alarms failed to report problems to controlcentre• Communication difficulties between thecontrol centre and pumping stations
  8. Maroochy SCADA attack, 2013 Slide 8Insider attack• Vitek Boden worked for Hunter Watertech(system suppliers) with responsibility for theMaroochy system installation.• He left in 1999 after disagreements with thecompany.• He tried to get a job with local Council butwas refused.
  9. Maroochy SCADA attack, 2013 Slide 9Revenge!• Boden was angry and decided to takerevenge on both his previous employer andthe Council by launching attacks on theSCADA control systems– He hoped that Hunter Watertech would be blamedfor the failure• Insiders don’t have to work inside anorganisation!
  10. Maroochy SCADA attack, 2013 Slide 10What happened?Image credit:http://www.pimaweb.org/conference/april2003/pdfs/MythsAndFactsBehindCyberSecurity.pdf
  11. Maroochy SCADA attack, 2013 Slide 11How it happened• Boden stole a SCADA configuration programfrom his employers when he left and installedit on his own laptop• He also stole radio equipment and a controlcomputer that could be used to impersonate agenuine machine at a pumping station• Insecure radio links were used tocommunicate with pumping stations andchange their configurations
  12. Maroochy SCADA attack, 2013 Slide 12Incident timeline• Initially, the incidents were thought to havebeen caused by bugs in a newly installedsystem• However, analysis of communicationssuggested that the problems were beingcaused by deliberate interventions• Problems were always caused by a specificstation id
  13. Maroochy SCADA attack, 2013 Slide 13Actions taken• System was configured so that that id was notused so messages from there had to bemalicious• Boden as a disgruntled insider fell undersuspicion and put under surveillance• Boden’s car was stopped after an incidentand stolen hardware and radio systemdiscovered
  14. Maroochy SCADA attack, 2013 Slide 14Causes of the problems• Installed SCADA system was completelyinsecure– No security requirements in contract withcustomer• Procedures at Hunter Watertech wereinadequate to stop Boden stealing hardwareand software• Insecure radio links were used forcommunications
  15. Maroochy SCADA attack, 2013 Slide 15Causes of the problems• Lack of monitoring and logging madedetection more difficult• No staff training to recognise cyber attacks• No incident response plan in place atMaroochy Council
  16. Maroochy SCADA attack, 2013 Slide 16Aftermath• On October 31, 2001 Vitek Boden wasconvicted of:– 26 counts of willfully using a computer to causedamage– 1 count of causing serious environment harm• Jailed for 2 years
  17. Maroochy SCADA attack, 2013 Slide 17

More Resources:

Lessons Learned from the Maroochy Water Breach
(http://www.ifip.org/wcc2008/site/IFIPSampleChapter.pdf)
Malicious Control System Cyber Security Attack Case Study – Maroochy Water Services, Australia
(http://csrc.nist.gov/groups/SMA/fisma/ics/documents/Maroochy-Water-Services-Case-Study_report.pdf)

Hacker jailed for revenge sewage attacks – Job rejection caused a bit of a stink
(http://www.theregister.co.uk/2001/10/31/hacker_jailed_for_revenge_sewage/)

Classic Hacker Case: Maroochy
(http://www.isssource.com/classic-hacker-case-maroochy-shire/)


 The Stuxnet Worm
A case study of cyberwarfare where a computer worm was used to attack the SCADA control systems of a uranium processing plant in Iran. The Stuxnet work is an example of a cyberattack where malware was introduced into the control systems of a nuclear plant with the aim of destroying equipment and slowing up Iran’s nuclear development programme.
YouTube Video –Stuxnet Worm Case Study – Ian Sommerville

Published on 18 Oct 2013
Discusses a cyberwarfare case study – the Stuxnet worm which was used to attack Iran’s uranium processing facilities

Presentation Slides – Stuxnet Worm- Ian Sommerville
(http://www.slideshare.net/sommerville-videos/stuxnet-worm)

 Download Ian Sommerville’s Slides (PPT 2,072KB)

Transcript

  1. Cybersecurity Case Study STUXNET worm Stuxnet SCADA attack, 2013 Slide 1
  2. Stuxnet SCADA attack, 2013 Slide 2
  3. Cyber-warfare • The STUXNET worm is computer malware which is specifically designed to target industrial control systems for equipment made by Siemens. • These systems are used in Iran for uranium enrichment – • Enriched uranium is required to make a nuclear bomb The aim of the worm was to damage or destroy controlled equipment Stuxnet SCADA attack, 2013 Slide 3
  4. What is a worm? • Malware that can infect a computerbased system and autonomously spread to other systems without user intervention • Unlike a virus, no need for a carrier or any explicit user actions to spread the worm Stuxnet SCADA attack, 2013 Slide 4
  5. The target of the worm Stuxnet SCADA attack, 2013 Slide 5
  6. The STUXNET worm • Worm designed to affect SCADA systems and PLC controllers for uranium enrichment centrifuges • Very specific targeting – only aimed at Siemens controllers for this type of equipment • It can spread to but does not damage other control systems Stuxnet SCADA attack, 2013 Slide 6
  7. Stuxnet SCADA attack, 2013 Slide 7
  8. Worm actions • Takes over operation of the centrifuge from the SCADA controller • Sends control signals to PLCs managing the equipment • Causes the spin speed of the centrifuges to vary wildly, very quickly, causing extreme vibrations and consequent damage • Blocks signals and alarms to control centre from Stuxnet SCADA attack, 2013 local PLCs Slide 8
  9. Stuxnet penetration • Initially targets Windows systems used to configure the SCADA system • Uses four different vulnerabilities to affect systems – Three of these were previously unknown – So if it encounters some systems where some vulnerabilities have been fixed, it still has the potential to infect them. – Spread can’t be stopped by fixing a single vulnerability Stuxnet SCADA attack, 2013 Slide 9
  10. Stuxnet technology • Spreads to Siemens’ WinCC/PCS 7 SCADA control software and takes over configuration of the system. • Uses a vulnerability in the print system to spread from one machine to another • Uses peer-to-peer transfer – there is no need for systems to be connected to the Internet Stuxnet SCADA attack, 2013 Slide 10
  11. The myth of the air gap • Centrifuge control systems were not connected to the internet • Initial infection thought to be through infected USB drives taken into plant by unwitting system operators – Beware of freebies! Stuxnet SCADA attack, 2013 Slide 11
  12. Damage caused • It is thought that between 900 and 1000 centrifuges were destroyed by the actions of Stuxnet • This is about 10% of the total so, if the intention was to destroy all centrifuges, then it was not successful • Significant slowdown in nuclear enrichment programme because of (a) damage and (b) enrichment shutdown while the worms were cleared from equipment Stuxnet SCADA attack, 2013 Slide 12
  13. Unproven speculations • Because of the complexity of the worm, the number of possible vulnerabilities that are exploited, the access to expensive centrifuges and the very specific targeting, it has been suggested that this is an instance of cyberwar by nation states against Iran Stuxnet SCADA attack, 2013 Slide 13
  14. Stuxnet SCADA attack, 2013 Slide 14
  15. Unproven speculations • Because Stuxnet did not only affect computers in nuclear facilities but spread beyond them by transfers of infected PCs, a mistake was made in its development • There was no intention for the worm to spread beyond Iran • Other countries with serious infections include India, Indonesia and Azerbaijhan Stuxnet SCADA attack, 2013 Slide 15
  16. Unproven speculations • The Stuxnet worm is a multipurpose worm and there are a range of versions with different functionality in the wild • These use the same vulnerabilities to infect systems but they behave in different ways Stuxnet SCADA attack, 2013 Slide 16
  17. One called Duqu has significantly affected computers, especially in Iran. This does not damage equipment but logs keystrokes and sends confidential information to outside servers. Stuxnet SCADA attack, 2013 Slide 17
  18. Summary 2013, Slide 18
  • Stuxnet worm is an early instance of cyberwarfare where SCADA controllers were targeted
  • Intended to disrupt Iran’s uranium enrichment capability by varying rotation speeds to damage centrifuges
  • Used a range of vulnerabilities to infect systems Stuxnet SCADA attack

Presentation Slides – Case study Stuxnet worm – Ian Sommerville
(http://www.slideshare.net/sommervi/cs-5032-2013-case-study-stuxnet-worm)

Transcript

  1. Cybersecurity Case Study STUXNET wormMaroochy SCADA attack, 2013 Slide 1
  2. Cyber-warfare • The STUXNET worm is computer malware which is specifically designed to target industrial controllers made by Siemens • These controllers are used in Iran in uranium enrichment equipment • Thought to be an instance of cyber- warfareMaroochy SCADA attack, 2013 Slide 2
  3. The STUXNET worm • Worm designed to affect SCADA systems and PLC controllers • Identified in 2010 • Very specific targeting – Siemens controllers controlling specific processes and equipment • Spreads to but does not damage otherMaroochy SCADA attack, 2013 systems Slide 3
  4. Worm actions • Takes over operation of the centrifuge from controller • Blocks signals and alarms to control centre • Causes the spin speed of the centrifuges to vary wildly, causing them to damage themselvesMaroochy SCADA attack, 2013 Slide 4
  5. Stuxnet technology • Uses a number of different vulnerabilities to affect systems • Initially targets Windows systems used to configure the SCADA system • Initial infection thought to be through infected USB drives taken into plant by unwitting controllers • Spreads by peer to peer transfer – no need for Internet connection • Spreads to Siemens WinCC/PCS 7 SCADA control software and takes over configuration of the systemMaroochy SCADA attack, 2013 Slide 5
  6. Damage caused • It is thought that between 900 and 1000 centrifuges were destroyed by the actions of Stuxnet • This is about 10% of the total so, if the intention was to destroy all centrifuges, then it was not successful • Significant slowdown in nuclear enrichment programme because of (a) damage and (b) more significantly, enrichment shutdown while the worms were cleared from equipmentMaroochy SCADA attack, 2013 Slide 6
  7. Unproven speculations • Because of the complexity of the worm, the number of possible vulnerabilities that are exploited and the very specific targeting, it has been suggested that this is an instance of cyberwar against Iran • It has been suggested that the developers of the worm were the secret services of the USA and IsraelMaroochy SCADA attack, 2013 Slide 7
  8. Unproven speculations • Because Stuxnet did not only affect computers in nuclear facilities but spread beyond them by transfers of infected PCs, a mistake was made in its development • There was no intention for the worm to spread beyond Iran • Other countries with serious infections include India, Indonesia and AzerbiajhanMaroochy SCADA attack, 2013 Slide 8
  9. Unproven speculations • The Stuxnet worm is a multipurpose worm and there are a range of versions with different functionality in the wild • One called Duqu has significantly affected computers, especially in Iran. This does not damage equipment but logs keystrokes and sends confidential information to outside servers.Maroochy SCADA attack, 2013 Slide 9
  10. Aftermath • We don’t know what will happen next • Possible further cyber attacks on Iran’s nuclear infrastructure • Possible retaliatory cyber-actions from Iran against the US and Israel • Escalation of cyber-warfareMaroochy SCADA attack, 2013 Slide 10

More Resources:

Kushner, D. (2013) The Real Story of Stuxnet – How Kaspersky Lab tracked down the malware that stymied Iran’s nuclear-fuel enrichment program
(http://spectrum.ieee.org/telecom/security/the-real-story-of-stuxnet)
This article is an accessible description of the Stuxnet worm that attacked nuclear processing facilities in Iran.

Zetter, K. (2014) An Unprecedented Look at Stuxnet, the World’s First Digital Weapon
(https://www.wired.com/2014/11/countdown-to-zero-day-stuxnet/)

Airbus 330/340 flight control system – software and hardware redundancy
Explains how redundancy and diversity is used in the flight control system of Airbus aircraft to ensure reliability and availability.

YouTube – Airbus FCS – software and hardware redundancy – Ian Sommerville
(https://www.youtube.com/watch?v=EOexjozpBdI)

Published on 15 Jan 2014
Explains the organisation of the safety-critical flight control system on the Airbus 330/340 and how redundancy and diversity is used in that system
Presentation Slides – Airbus Flight Control System – Ian Sommerville
(http://www.slideshare.net/sommerville-videos/airbus-fcs)

Transcript

  1. The Airbus flight control system Ian Sommerville Airbus Flight Control System, 2013 Slide 1
  2. The organisation of the Airbus A330/340 flight control system to provide software and hardware reliability Airbus Flight Control System, 2013 Slide 2
  3. Airbus Flight Control System, 2013 Slide 3
  4. Airbus flight control systems • Airbus were the first commercial aircraft manufacturer to use „fly by wire‟ flight control systems Airbus Flight Control System, 2013 Slide 4
  5. “Fly by wire” control • Older aircraft control systems rely on mechanical and hydraulic links between the aircraft‟s controls and the flight surfaces on the wings and tail. • The cockpit controls and flight surfaces are directly connected – pilot actions are transmitted directly to the control system Airbus Flight Control System, 2013 Slide 5
  6. Airbus Flight Control System, 2013 Slide 6
  7. In fly-by-wire systems, the cockpit controls generate electronic signals that are interpreted by a computer system and are then converted into outputs that drive the hydraulic system connected to the flight surfaces. Airbus Flight Control System, 2013 Slide 7
  8. Advantages of „fly-by-wire‟ • Weight reduction – By reducing the mechanical linkages and hydraulic fluids, a significant amount of weight (and hence fuel) is saved. Airbus Flight Control System, 2013 Slide 8
  9. Pilot workload reduction – The fly-by-wire system provides a more usable interface and takes over some computations that previously would have to be carried out by the pilots. Airbus Flight Control System, 2013 Slide 9
  10. Airframe safety – By mediating the control commands, the system can ensure that the pilot cannot put the aircraft into a state that stresses the airframe or stalls the aircraft. Airbus Flight Control System, 2013 Slide 10
  11. Fault tolerance • Fly-by-wire systems must be fault tolerant as there is no „fail-safe‟ state when the aircraft is in operation. • In the Airbus, this is achieved by replicating sensors, computers and actuators and providing „graceful degradation‟ in the event of a system failure. • In a degraded state, essential facilities remain available allowing the pilot to fly and land the plane. Airbus Flight Control System, 2013 Slide 11
  12. The Airbus FCS has quintuple redundancy i.e it has 5 flight control computers but only 1 computer is needed to fly the place • Therefore, the system can stand to lose 4 computers and the plane will still be flyable. Airbus Flight Control System, 2013 Slide 12
  13. Airbus FCS organization • Three primary flight control computers • These are the main flight control systems and are responsible for calculations concerned with aircraft control and with sending signals to the flight surfaces and aircraft engines. Airbus Flight Control System, 2013 Slide 13
  14. Two secondary flight control computers which are backup systems for the flight control computers. • Control switches automatically to these systems if the primary computers are unavailable. Airbus Flight Control System, 2013 Slide 14
  15. Hardware diversity • The primary and secondary flight control computers use different processors. • The primary and secondary flight control computers are designed and supplied by different companies. • The processor chips for the different computers are supplied by different manufacturers. • All of this reduces the probability of common errors in the hardware causing system failure. Airbus Flight Control System, 2013 Slide 15
  16. Software diversity • The primary and secondary computers run different software • The secondary computer software is a simpler version of the primary control software • The software for each system has been developed by different teams Airbus Flight Control System, 2013 Slide 16
  17. Self-monitoring architecture Airbus Flight Control System, 2013 Slide 17
  18. Self-monitoring architectures • Multi-channel architectures where the system monitors its own operations and takes action if inconsistencies are detected. • The same computation is carried out on each channel and the results are compared. If the results are identical and are produced at the same time, then it is assumed that the system is operating correctly. • If the results are different, then a failure is assumed and a failure exception is raised. Airbus Flight Control System, 2013 Slide 18
  19. Self-monitoring systems • Hardware in each channel has to be diverse so that common mode hardware failure will not lead to each channel producing the same results. • Software in each channel must also be diverse, otherwise the same software error would affect each channel. • If high-availability is required, you may use several self-checking systems in parallel. – This is the approach used in the Airbus family of aircraft for 19 Slide Airbus Flight Control System, 2013
  20. Airbus Flight Control System, 2013 Slide 20
  21. Channel diversity • The software for the different channels in each computer has been developed by different teams using different programming languages Airbus Flight Control System, 2013 Slide 21
  22. Primary/secondary diversity • The software for the primary and secondary flight control computers has been developed by different teams. • For the secondary computers, the channels are programmed by different teams using different languages. • Therefore, 4 different versions of the flight control software have been developed. Airbus Flight Control System, 2013 Slide 22
  23. Dynamic reconfiguration • The FCS reconfigures itself dynamically to cope with a loss of system resources if 2 FCS computers fail • Dynamic reconfiguration involves switching to a simpler mode of software control, with less to go wrong. • Three operational modes are supported – Normal – control plus reduction of workload – Alternate – minimal computer-mediated control – Direct – no computer-mediation of pilot commands Airbus Flight Control System, 2013 Slide 23
  24. Control diversity • The linkages between the flight control computers and the flight surfaces are arranged so that each surface is controlled by multiple independent actuators. • Each actuator is controlled by different computers so loss of a single actuator or computer will not mean loss of control of that surface. • The hydraulic system is 3-way replicated and these take different routes through the plane. Airbus Flight Control System, 2013 Slide 24
  25. Summary • Airbus FCS designed around hardware and software redundancy and diversity • Quintuple redundancy with 5 control computers – only one required to fly aircraft • Each computer based on a self-monitoring architecture with 2 channels • Multiple different versions of software developed and execute simultaneously Airbus Flight Control System, 2013 Slide 25
The Ariane 5 Launcher Explosion
Explains how the software failure in a reused subsystem led to the failure of the Ariane launch vehicle on its first flight.
YouTube Video – Ariane Launch Failure – Explaining the causes of the Ariane 5 launch explosion – Ian Sommerville
(https://www.youtube.com/watch?v=W3YJeoYgozw)
https://www.youtube.com/watch?v=W3YJeoYgozw
Published on 15 Jan 2014
Explains why a software failure on the first launch of the Ariane 5 rocket was responsible for the failure and complete destruction of the rocket and its payload.

YouTube Video – Ariane 5 Rocket First Launch Failure – A video of the take-off and explosion after 37 seconds – Longer video of ‘Ariane 5’ Rocket first launch failure/explosion
(https://www.youtube.com/watch?v=gp_D8r-2hwk)

https://www.youtube.com/watch?v=gp_D8r-2hwk
Uploaded on 21 Sep 2010
People have uploaded shorter copies, but here’s a longer copy of the Ariane 5 rocket’s ill-fated first launch, which ended in explosion back in 1996. Now a quite reliable rocket, the failure was caused by a “software bug”. As it was an unmanned flight, there were no victims, but it was an expensive blunder, destroying the scientific “Cluster” spacecraft, which luckily managed to get the funding be re-built for a later mission. In this video, you see the engine starting, the countdown announcer, most of the launch, the rocket’s failure, and people watching debris falling from the sky.
Presentation Slides – Ariane 5 launcher failure – Ian Sommerville
Transcript
  1. The Ariane 5 Launcher Failure Ian Sommerville Ariane launcher failure, Case study, 2013 Slide 1
  2. June 4th 1996 Total failure of the Ariane 5 launcher on its maiden flight Ariane launcher failure, Case study, 2013 Slide 2
  3. Ariane 5 • A European rocket designed to launch commercial payloads (e.g.communications satellites, etc.) into Earth orbit • Successor to the successful Ariane 4 launchers Ariane launcher failure, Case study, 2013 Slide 3
  4. Ariane launcher failure, Case study, 2013 Slide 4
  5. Ariane 5 can carry a heavier payload than Ariane 4 • Now the standard launch vehicle for the European Space Agency Ariane launcher failure, Case study, 2013 Slide 5
  6. Ariane launcher failure, Case study, 2013 Slide 6
  7. Launcher failure • First test launch of Ariane 5 in June 1996 • Appoximately 37 seconds after a successful lift-off, the Ariane 5 launcher lost control Ariane launcher failure, Case study, 2013 Slide 7
  8. Ariane launcher failure, Case study, 2013 Slide 8
  9. Incorrect control signals were sent to the engines and these swivelled so that unsustainable stresses were imposed on the rocket • The vehicle started to break up because of the stresses imposed and selfdestructed Ariane launcher failure, Case study, 2013 Slide 9
  10. The problem • The attitude and trajectory of the rocket are measured by a computer-based inertial reference system. • The IRS transmits commands to the engines to maintain attitude (the angle to the vertical) and direction Ariane launcher failure, Case study, 2013 Slide 10
  11. The system failure was a direct result of a software failure. • However, it was symptomatic of a more general systems validation failure Ariane launcher failure, Case study, 2013 Slide 11
  12. Sensors IRS 1 IRS 2 Instructions to engine control system Ariane launcher failure, Case study, 2013 Slide 12
  13. The IRS had both a primary and a backup computer • The backup computer was included to cope with hardware failure but both the primary and the backup system ran the same software. Ariane launcher failure, Case study, 2013 Slide 13
  14. The IRS software in both the primary and the backup computer shut itself down 37 seconds after take-off • Diagnostic data about the shutdown was sent to the engine control system • This system did not expect such data and interpreted these as real data • The values were such that the system swivelled the rocket engines to an extreme position Ariane launcher failure, Case study, 2013 Slide 14
  15. Software failure • Software failure occurred when an attempt to convert a 64-bit floating point number representing the horizontal velocity to a signed 16-bit integer caused the number to overflow (become too big). Ariane launcher failure, Case study, 2013 Slide 15
  16. 0 111111111111111 Sign Ariane launcher failure, Case study, 2013 16-bit integer Max value (32768) Slide 16
  17. There was no exception handler associated with the conversion so the system exception management facilities were invoked. These shut down the software controlling the IRS. • Redundant but not diverse software • The backup software was a copy and behaved in exactly the same way i.e. the number overflowed and the system was shut down Ariane launcher failure, Case study, 2013 Slide 17
  18. Avoidable failure? • The software that failed was reused from the Ariane 4 launch vehicle. The computation that resulted in overflow was not used by Ariane 5. • The calculations had been transferred to a ground-based system in Ariane 5 Ariane launcher failure, Case study, 2013 Slide 18
  19. Implementation decisions • Decisions were made – Not to remove the facility as this could introduce new faults – Not to test for overflow exceptions because the processor was heavily loaded. – For dependability reasons, it was thought desirable to have some spare processor capacity Ariane launcher failure, Case study, 2013 Slide 19
  20. Why not Ariane 4? • The physical characteristics of Ariane 4 (A smaller vehicle) are such that it has a lower initial acceleration and build up of horizontal velocity than Ariane 5 • The value of the variable on Ariane 4 could never reach a level that caused overflow during the launch period. • This calculation had been carried out during the development of Ariane 4 and it was therefore decided that no overflow check was required Ariane launcher failure, Case study, 2013 Slide 20
  21. Validation failure • As the facility that failed was not required for Ariane 5, there was no requirement associated with it. • As there was no associated requirement, there were no tests of that part of the software and hence no possibility of discovering the problem. Ariane launcher failure, Case study, 2013 Slide 21
  22. Simulator-based testing • During system testing, simulators of the inertial reference system computers were used. • These did not generate the error as there was no requirement for the unused code to be included in the simulator Ariane launcher failure, Case study, 2013 Slide 22
  23. Review failure • Review failures? • The inertial reference system software was not reviewed because it had been used in a previous version • The review failed to expose the problem or that the test coverage would not reveal the problem • The review failed to appreciate the consequences of system shutdown during a launch Ariane launcher failure, Case study, 2013 Slide 23
  24. Ariane launcher failure, Case study, 2013 Slide 24
  25. Lessons learned • Don’t run software in critical systems unless it is actually needed • As well as testing for what the system should do, you may also have to test for what the system should not do • Do not have a default exception handling response which is system shut-down in systems that have no fail-safe state Ariane launcher failure, Case study, 2013 Slide 25
  26. Lessons learned • In critical computations, always return best effort values even if the absolutely correct values cannot be computed • Wherever possible, use real equipment and not simulations • Improve the review process to include external participants and review all assumptions made in the code Ariane launcher failure, Case study, 2013 Slide 26
The Ariane 5 Accident: A Programming Problem?
(http://www.rvs.uni-bielefeld.de/publications/Reports/ariane.html)
ARIANE 5 – Flight 501 Failure – Report by the Inquiry Board (1996)
(http://sunnyday.mit.edu/accidents/Ariane5accidentreport.html)
 Nuseibeh, B. (1997) Ariane 5: Who Dunnit?, IEEE Software, 14 (3), 1997.
A short article that explains the complex causes of the failure of the Ariane 5 software in the inertial navigation system.
Download Ariane 5: Who Dunnit? (PDF 439KB)


The 1993 Warsaw Airbus accident

Explains how software in the braking control system on the Airbus operated as specified but not in a safe way. It shows reliable systems can be unsafe.
YouTube Video – Warsaw airbus accident 1993 – Ian Sommerville
(https://www.youtube.com/watch?v=wzoxek74RTs)
https://www.youtube.com/watch?v=wzoxek74RTs
Published on 15 Jan 2014
Explains how the software controlling the braking system on an Airbus was a causal factpr in an accident when the plane ran into banking at the end of the runway.

YouTube Video – Warsaw airbus accident- Ian Sommerville
(http://www.slideshare.net/sommerville-videos/warsaw-airbus-accident)

Transcript

  1. Warsaw airbus accident 1993 Ian Sommerville Warsaw aircraft accident, 1993 Slide 1
  2. 2. What happened • A Lufthansa Airbus on a flight from Frankfurt landed at Warsaw Airport in bad weather (rain and strong winds) • On landing, the aircraft’s software controlled braking system did not deploy when activated by the flight crew and it was about 9 seconds before the braking system activated Warsaw aircraft accident, 1993 Slide 2
  3. 3. • There was insufficient runway remaining to stop the plane and the aircraft ran into a glass embankment • Two people were killed and 54 injured Warsaw aircraft accident, 1993 Slide 3
  4. 4. Causes of the accident • As with most accidents, there were multiple factors that contributed to this accident. The three main contributory causes were: – The aircraft pilots were given outdated information on the wind speed and direction by the landing controllers – The aircrew failed to notice that the on-board information about the wind direction was inconsistent with that provided by the controllers Warsaw aircraft accident, 1993 and that their approach speed was higher than Slide 4
  5. 5. • The aircraft braking control software specification had failed to take into account the landing conditions encountered Warsaw aircraft accident, 1993 Slide 5
  6. 6. Focus on software • The braking control system on the Airbus behaved exactly as specified • There were no bugs or errors in the software • This is an example of a situation of where a reliable software system was unsafe Warsaw aircraft accident, 1993 Slide 6
  7. Aircraft braking • Aircraft braking depends on deployment of spoilers which are flaps on the wings that are deployed to slow down the plane • It also makes use of ‘reverse thrust’ which means that the engines are run ‘backwards’ so that their effect is to Warsaw aircraft accident, 1993 Slide 7
  8. It is critical to the safety of the flight that neither the spoilers nor the reverse thrust is deployed while the plane is in the air • Therefore, the braking system software includes checks to ensure that the plane has landed before the braking system is deployed Warsaw aircraft accident, 1993 Slide 8
  9. Weight on wheels • The landing gear includes sensors that can detect if the wheel struts are compressed i.e. that there is weight on the wheels. • The software specification was that landing could be recognised if there was weight on both wheels Warsaw aircraft accident, 1993 Slide 9
  10. Wheel rotation • Each wheel included sensors that checked whether the wheel was rotating or not. • The software specification was that the aircraft had landed if the speed of wheel rotation was greater than 72 knots Warsaw aircraft accident, 1993 Slide 10
  11. The braking system could be deployed if either of these conditions were true • This was checked by the braking system control software Warsaw aircraft accident, 1993 Slide 11
  12. The software specification did not anticipate a situation where neither of these conditions would hold during landing IF weight-on-both-wheels OR (left-wheel-turning OR right-wheel-turning) THEN braking-system-deployment := permitted Warsaw aircraft accident, 1993 Slide 12
  13. In this case, because of the weather conditions, the plane landed at an angle so that one wheel touched the runway first • The runway was wet and that wheel ‘acquaplaned’ so skidded along the runway without turning Warsaw aircraft accident, 1993 Slide 13
  14. What went wrong? • The pilots were told that there was a crosswind across the runway • Standard procedure for a crosswind landing to bank the aircraft so that initial touchdown is on one wheel and the crosswind then acts on the wing to push the other wheel onto the runway Warsaw aircraft accident, 1993 Slide 14
  15. However, in this case, the wind had changed direction so that it was a tailwind rather than a crosswind • This meant that the landing speed was higher than normal and there was no need for a single wheel touchdown This was not noticed by the pilots and the higher speed was a contributory factor to the accident Warsaw aircraft accident, 1993 Slide 15
  16. Warsaw aircraft accident, 1993 Slide 16
  17. Warsaw aircraft accident, 1993 Slide 17
  18. The Warsaw Airbus landed on one wheel but there was no crosswind to push down the other wheel so, for 9 seconds, the plane was landing on a single wheel • Because there was only weight on a single wheel, the on-ground condition of weight on both wheels in the braking system did not hold Warsaw aircraft accident, 1993 Slide 18
  19. Acquaplaning Warsaw aircraft accident, 1993 Slide 19
  20. The single wheel on the ground was acquaplaning rather than turning so the condition that one or both wheels should be rotating at more than 72 knots did not hold • After about 9 seconds, the 2nd wheel made contact with the runway and the braking system deployed • But it was too late to stop the aircraft and the accident occurred Warsaw aircraft accident, 1993 Slide 20
  21. Warsaw aircraft accident, 1993 Slide 21
  22. Conclusions • In practice, it is impossible to make any system completely safe • It is impossible for system designers to anticipate every possible condition and they have to make assumptions such as the pilots being given correct wind information • No blame in this case was associated with the software but it was modified to take this particular situation into account should it happen again Warsaw aircraft accident, 1993 Slide 22

Main Commission Aircraft Accident Investigation Warsaw (1994) Report on the Accident to Airbus A320-211 Aircraft in Warsaw on 14 September 1993
(http://www.rvs.uni-bielefeld.de/publications/Incidents/DOCS/ComAndRep/Warsaw/warsaw-report.html)
The enquiry report after the accident that sets out the (complex) causes of the accident and discusses how the software behaviour was a contributory factor to this.


Kegworth Air Crash, 1989
This was an air accident that illustrates the complexity of system failure.

Kegworth air disaster
(https://en.wikipedia.org/wiki/Kegworth_air_disaster)
The Kegworth air disaster occurred when a Boeing 737-400 crashed on to the embankment of the M1 motorway near Kegworth, Leicestershire, England, while attempting to make an emergency landing at East Midlands Airport on 8 January 1989.
British Midland Flight 92 was on a scheduled flight from London Heathrow Airport to Belfast Airport, when a fan-blade broke in the left engine, disrupting the air-conditioning and filling the flight-deck with smoke. The pilots believed that this indicated a fault in the right engine, since earlier models of the 737 ventilated the flight-deck from the right, and they were unaware that the 400 used a different system. The crew mistakenly shut down the good engine, and pumped more fuel into the malfunctioning one, which burst into flames. Of the 126 people aboard, 47 died and 74 sustained serious injuries.
The inquiry attributed the blade fracture to metal fatigue, caused by heavy vibration in the newly upgraded engines, which had only been tested in the laboratory and not under representative flight conditions.

Presentation Slides – Case study Kegworth air disaster – Ian Sommerville
(http://www.slideshare.net/sommervi/cs5032-case-study-kegworth-air-disaster)

Transcript

  1. Kegworth Air Disaster<Presentation>, 2008 Slide 1
  2. The Kegworth Air Disaster• 8th January 1989• British Midland Flight 92 – Heathrow to Belfast• Boeing 737-400 – New variant of Boeing 737• Crashes by the M1 near Kegworth, attempting an emergency landing at East Midlands Airport• 118 passengers, 8 Crew – 47 die, and 74 seriously injured <Presentation>, 2008 Slide 2
  3. The Kegworth Air Disaster• The left engine was unable to cope with the vibrations caused when operating under high power settings above 25,000 feet.• A fan blade broke off, causing an increase in vibration, reduction in power, and there was a large trail of flame behind the engine.• The pilot shut down the engine on the right.• The plane flew for another 20 minutes until the left engine failed <Presentation>, 2008 Slide 3
  4. Right engine shutdown • Mistake in knowledge based performance – Smoke in the cabin indicates that the engine from which bleed air (used for heating, pressure, etc) is taken will have smoke in it. But, the pilot thought bleed air was taken from the right engine. This is true of the Boeing 737 but not the new 737-400, which drew bleed air from both. • Design issue – No visibility of engines, so relied on other information sources to explain vibrations • Design issue – The vibration sensors were tiny, had a new digital display style and were inaccurate on the 737 (not the 737-400) • Inadequate training – A one day course, and no simulator training<Presentation>, 2008 Slide 4
  5. Failure to detect error • Coincidence – The smoke disappeared after shutting down the right engine and the vibrations lessened. “Confirmation bias”. • Lapse in procedure – After shutting down the right engine the pilot began checking all meters and reviewing decisions but stopped after being interrupted by a transmission from the airport asking him to descend to 12,000 ft. • Lack of Communication – The cabin crew and passengers could see the left engine was on fire, but did not inform the pilot, even when the pilot announced he was shutting down the right engine. • Design Issue – The vibration meters would have shown a problem with the left engine, but were too difficult to read. There was no alarm.<Presentation>, 2008 Slide 5
  6. Cockpit of a Boeing 737 <Presentation>, 2008 Slide 6
  7. Cockpit of a Boeing 737-400 <Presentation>, 2008 Slide 7
  8. Cockpit of a Boeing 737-400 <Presentation>, 2008 Slide 8
  9. Conclusion • Pilot error? • Crew training? • User interface design? • Aircraft design? • Engineering problems? • Lack of proper training?<Presentation>, 2008 Slide 9
  10. Failures are rarely ever simple! The problem is complexity<Presentation>, 2008 Slide 10

London Ambulance Service Computer Aided Dispatch (LASCAD) Failure

The London Ambulance Service introduced a new computer-aided despatch system in 1992 which was intended to automate the system that despatched ambulances in response to calls from the public and the emergency services. This new system was extremely inefficient and ambulance response times increased markedly. Shortly after its introduction, it failed completely and LAS reverted to the previous manual system. The systems failure was not just due to technical issues but to a failure to consider human and organisational factors in the design of the system.

This case study can be used in a discussion of human factors as an illustration of how procurement, human and organisational issues can be major contributors to system failure.

Download Failure of LASCAD PowerPoint Presentation (PPT 432KB)

Download LASCAD Failure PowerPoint Presentation – Sommerville (PPT KB)
An overview of the Case Study

Download LASCAD Case Study (Word 30KB)

Download London Ambulance System Disaster Case Study (Word 86KB)

Download LASCAD Project Management Presentation (PDF KB)

London Ambulance Service
(https://en.wikipedia.org/wiki/London_Ambulance_Service)

Report of the Inquiry Into The London Ambulance Service (February 1993) – International Workshop on Software Specification and Design Case Study
(http://www0.cs.ucl.ac.uk/staff/A.Finkelstein/las/lascase0.9.pdf)
A report of the official enquiry into the system failure

Finkelstein, A. and Dowell, J. () A Comedy of Errors: the London Ambulance Service case study
(http://www0.cs.ucl.ac.uk/staff/a.finkelstein/papers/lascase.pdf)
Download “A Comedy of Errors: the London Ambulance Service case study” (PDF 23KB)
Overview of the LASCAD report

Beynon-Davies, P. (1995) Information systems ‘failure’: the case of the London Ambulance Service’s Computer Aided Despatch project, European Journal of Information Systems, Vol. 4 pp.171-184

Download Beynon-Davies, P. () Human error and information systems failure: the case of the London ambulance service computer-aided despatch system project (PDF 231KB)

Download Beynon-Davies, P. () Human error and risk assessment: the case of the London ambulance service computer-aided despatch system (PDF 2,161KB)

Download Adamu, M., Alkazmi, A., Alsufyani, A., Al Shaigy, B., Chapman, D. and Chappell, J. () London Ambulance Service Software Failure (PDF 498KB)

London Ambulance Service Computer Aided Dispatch (LASCAD) Failure – An analysis of the failure of the London Ambulance Service Computer Aided Dispatch system
(http://www.savive.com/casestudy/londonambulance.html)

CAD Failure LAS. 1992
(http://www.lond.ambulance.freeuk.com/cad.html)

Oct. 26, 1992: Software Glitch Cripples Ambulance Service
(https://www.wired.com/2009/10/1026london-ambulance-computer-meltdown/)

The 1992 London Ambulance Service Computer Aided Dispatch System Failure
(http://erichmusick.com/writings/technology/1992-london-ambulance-cad-failure.html)

London Ambulance Service computer fails again
(http://catless.ncl.ac.uk/Risks/14.02.html#subj9.1)

Ambulance Dispatch System
(http://catless.ncl.ac.uk/Risks/17.39.html#subj1)

London Ambulance Service Inquiry Report (long)
(http://catless.ncl.ac.uk/Risks/14.48.html#subj3)

London Ambulance Service
(http://catless.ncl.ac.uk/Risks/13.88.html#subj1)

Failure of London Ambulance despatch system
(http://catless.ncl.ac.uk/Risks/13.89.html#subj7)