A Formal Investigation of Diff3

The diff3 algorithm is widely considered the gold standard for merging uncoordinated changes to list-structured data such as text files. Surprisingly, its fundamental properties have never been studied in depth.

  • PDF / 436,609 Bytes
  • 12 Pages / 430 x 660 pts Page_size
  • 40 Downloads / 130 Views

DOWNLOAD

REPORT


University of Pennsylvania 2 Yahoo

Abstract. The diff3 algorithm is widely considered the gold standard for merging uncoordinated changes to list-structured data such as text files. Surprisingly, its fundamental properties have never been studied in depth. We offer a simple, abstract presentation of the diff3 algorithm and investigate its behavior. Despite abundant anecdotal evidence that people find diff3’s behavior intuitive and predictable in practice, characterizing its good properties turns out to be rather delicate: a number of seemingly natural intuitions are incorrect in general. Our main result is a careful analysis of the intuition that edits to “well-separated” regions of the same document are guaranteed never to conflict.

1

Introduction

Users often want to edit a local copy of a replicated data structure, postponing the moment when their changes become visible to others until sometime later— when a set of changes has been finished and tested, when an offline laptop is reconnected to the network, etc. In general, when multiple users can edit at the same time, this reconciliation process requires a tool—a synchronizer — that can propagate non-conflicting changes between different copies of the data, while recognizing and flagging conflicts. Source code management systems, longdistance collaborative editing environments, and file synchronizers are examples. Operation-based synchronizers work by keeping track of the complete sequences of operations that have been applied to each replica and, during reconciliation, attempting to synthesize a single unified view of the data structure’s edit history. By contrast, a state-based synchronizer sees only the current versions of the replicas to be reconciled, together with an archive of the last state they had in common (perhaps saved away at the end of the last synchronization). A crucial problem faced by a state-based synchronizer is how to align the information in the current replicas and the archive, so that it can tell where changes have been made. This can be accomplished in a variety of ways, depending on the nature of the data being synchronized. Where the data is rigidly structured or where keys are available (e.g., in personal information management applications such as address books), the proper alignment is generally clear. For more flexibly structured data, such as semistructured databases, file systems, and text documents, it is less clear how to reliably choose alignments that users consider natural. The issue is particularly vexing for pure textual (or, more generally, list-structured) data, which offers no predetermined points of reference for alignment—the structures are presented to the synchronizer as flat sequences of uninterpreted atoms (characters, V. Arvind and S. Prasad (Eds.): FSTTCS 2007, LNCS 4855, pp. 485–496, 2007. c Springer-Verlag Berlin Heidelberg 2007 

486

S. Khanna, K. Kunal, and B.C. Pierce

words, or lines of text)—and for which common edits include arbitrary insertions, deletions, and rearrangements of existing material. The best known t