Replication in Distributed Systems: Models, Methods, and Protocols

  • PDF / 772,490 Bytes
  • 10 Pages / 612 x 792 pts (letter) Page_size
  • 79 Downloads / 186 Views

DOWNLOAD

REPORT


plication in Distributed Systems: Models, Methods, and Protocols A. R. Nasibullina,* and B. A. Novikovb,** a

St. Petersburg State University, Universitetskaya nab. 7/9, St. Petersburg, 199034 Russia b National Research University Higher School of Economics, ul. Soyuza Pechatnikov 16, St. Petersburg, 190121 Russia *e-mail: [email protected] **e-mail: [email protected] Received March 10, 2019; revised September 20, 2019; accepted October 25, 2019

Abstract—Data replication is used to enhance the reliability, availability, and throughput of database systems at a price of increased complexity and cost of data updates. In many cases, data storage systems that exploit replication use relaxed consistency criteria. This survey describes different replication schemes and discusses several consistency models, protocols, and techniques designed to support consistency in replicated systems. DOI: 10.1134/S0361768820050060

1. INTRODUCTION In this paper, we discuss various consistency models and replication protocols. Data replication is a method for organizing data storage whereby each data element is stored in several copies hosted on different servers. Each of these copies is called a replica. The goal of replication is to prevent data loss. Data replication has the following properties. • Availability: it provides access to data when some replicas are unavailable. • Reliability: it prevents data loss when servers are partially destroyed or some replicas are lost. • Throughput: it increases data throughput. In this paper, we use the following quantitative characteristics of databases: throughput and scalability. System throughput is defined as the number of actions of varying complexity performed per unit of time. Scalability is defined by the following formula: (1) S = M, N where M is the throughput of a database consisting of many evaluators with a certain amount of data and certain load, while N is the throughput of a system consisting of a single evaluator. Let us introduce the definition of a distributed database. A distributed database is a data storage system consisting of local databases hosted on various nodes of a computer network. Every local database can have a replica.

When using replication, the problem of maintaining all replicas in a consistent state arises [1–5]. In other words, the result of query execution can vary depending on which replica is used by the client. A client query executed on some replica may process outdated data. In this case, maintaining consistency on all replicas can affect the performance of a distributed database, reduce the number of processable queries, and cause scaling problems. Traditionally, consistency is regarded in the context of a database management system (DBMS) as a means for correlating values of different data elements and is implemented using transaction management tools. Informally, transactions are often characterized by their ACID properties, while the degree of consistency is described in terms of isolation levels. This understanding of consistency is abstracted