Fault-Tolerance Techniques for High-Performance Computing
This timely text/reference presents a comprehensive overview of fault tolerance techniques for high-performance computing (HPC).The text opens with a detailed introduction to the concepts of checkpoint protocols and scheduling algorithms, prediction, repl
- PDF / 8,942,512 Bytes
- 325 Pages / 453.543 x 683.15 pts Page_size
- 26 Downloads / 244 Views
Thomas Herault Yves Robert Editors
Fault-Tolerance Techniques for HighPerformance Computing
Computer Communications and Networks Series editor A.J. Sammes Centre for Forensic Computing Cranfield University, Shrivenham Campus Swindon, UK
The Computer Communications and Networks series is a range of textbooks, monographs and handbooks. It sets out to provide students, researchers, and nonspecialists alike with a sure grounding in current knowledge, together with comprehensible access to the latest developments in computer communications and networking. Emphasis is placed on clear and explanatory styles that support a tutorial approach, so that even the most complex of topics is presented in a lucid and intelligible manner.
More information about this series at http://www.springer.com/series/4198
Thomas Herault Yves Robert •
Editors
Fault-Tolerance Techniques for High-Performance Computing
123
Editors Thomas Herault University of Tennessee Knoxville, TN USA
Yves Robert Ecole Normale Supérieure de Lyon Lyon France and University of Tennessee Knoxville, TN USA
ISSN 1617-7975 ISSN 2197-8433 (electronic) Computer Communications and Networks ISBN 978-3-319-20942-5 ISBN 978-3-319-20943-2 (eBook) DOI 10.1007/978-3-319-20943-2 Library of Congress Control Number: 2015942754 Springer Cham Heidelberg New York Dordrecht London © Springer International Publishing Switzerland 2015 for Chapters 1, 3, 4 and 5 © Springer International Publishing Switzerland (outside the USA) 2015 for Chapter 2 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. Printed on acid-free paper Springer International Publishing AG Switzerland is part of Springer Science+Business Media (www.springer.com)
Preface
Objective The main objective of this monograph is to provide an overview of Fault-Tolerance Techniques for High-Performance Computing (HPC). Resilience has already become a prominent issue on current large-scale platforms. The advent of exascale computers with millions of cores and billion-parallelism
Data Loading...