!!! !!! !!! !!! !!! !!! !!! !!! !!! !!! !!! !!! !!! !!! !!! !!! !!! !!! !!! !!! ------------------------------------------------------------------------------- B O S T O N U N I V E R S I T Y Computer Science Department C O L L O Q U I U M Architecting for earthquakes: fault tolerant storage Elizabeth Borowski Storage Systems Program Hewlett Packard Labs Wednesday, March 10th 11:00am (Coffee served at 10:45am) Seminar Room / MCS 135 ------------------------------------------------------------------------------- In today's information-centric global marketplace, business critical data must be available all the time. Rain or shine, earthquake in California or tornado in Kansas, the data center needs to be on-line meeting quality of service guarantees twenty-four hours a day, seven days a week, with no exceptions. This dire need for continual high performance and high availability is best met by an active system that can monitor, diagnose, and repair itself on the fly. Key design features are on-line load balancing, fluid scalability, and automatic recovery from failure. In the Palladio project at HP Labs we are designing a fault tolerant distributed storage system to meet these goals. The advent of high speed back-end storage networks enables distributed storage to be accessed as quickly as a local disk. These SANs (storage attached networks) facilitate access to large pools of heterogeneous storage by multiple hosts. In order to achieve automatic load balancing and fault tolerance, our approach is to hide the details of the underlying data placement from the hosts. The upper layers of the system see only a virtual store abstraction of the data. Maintaining this abstraction, while still providing quality of service guarantees and data coherency, is not trivial. This is especially true in the face of network partitions, device crashes, and failures in the storage management system. In this talk I will discuss the architectural choices we've made in designing the storage management system, and present in detail our solution to disaster recovery. I will prove the liveness and safety properties of the recovery protocol. Namely, I will show the system eventually recovers from the failure (if at all possible) and that data coherency is guaranteed throughout. Host: Azer Bestavros (best@cs.bu.edu) ------------------------------------------------------------------------------- For colloquium info, including directions, see http://cs-www.bu.edu/colloquium -------------------------------------------------------------------------------