Efficient Online String Matching Based on Characters Distance Text Sampling

  • PDF / 1,549,492 Bytes
  • 23 Pages / 439.37 x 666.142 pts Page_size
  • 37 Downloads / 246 Views

DOWNLOAD

REPORT


Efficient Online String Matching Based on Characters Distance Text Sampling Simone Faro1   · Francesco Pio Marino1 · Arianna Pavone2 Received: 23 April 2018 / Accepted: 3 June 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract Searching for all occurrences of a pattern in a text is a fundamental problem in computer science with applications in many other fields, like natural language processing, information retrieval and computational biology. Sampled string matching is an efficient approach recently introduced in order to overcome the prohibitive space requirements of an index construction, on the one hand, and drastically reduce searching time for the online solutions, on the other hand. In this paper we present a new algorithm for the sampled string matching problem, based on a characters distance sampling approach. The main idea is to sample the distances between consecutive occurrences of a given pivot character and then to search online the sampled data for any occurrence of the sampled pattern, before verifying the original text. From a theoretical point of view we prove that, under suitable conditions, our solution can achieve both linear worst-case time complexity and optimal averagetime complexity. From a practical point of view it turns out that our solution shows a sub-linear behaviour in practice and speeds up online searching by a factor of up to 9, using limited additional space whose amount goes from 11 to 2.8% of the text size, with a gain up to 50% if compared with previous solutions. Keywords  String matching · Text processing · Efficient searching · Text indexing

* Simone Faro [email protected] Arianna Pavone [email protected] 1

Dipartimento di Matematica e Informatica, Università di Catania, viale A.Doria n.6, 95125 Catania, Italy

2

Dipartimento di Scienze Cognitive, Università di Messina, via Concezione n.6, 98122 Messina, Italy



13

Vol.:(0123456789)

Algorithmica

1 Introduction String matching is a fundamental problem in computer science and in the wide domain of text processing. It consists in finding all the occurrences of a given pattern x, of length m, in a large text y, of length n, where both sequences are composed by characters drawn from an alphabet Σ of size 𝜎 . Although data are memorized in different ways, textual data remains the main form to store information. This is particularly evident in literature and in linguistics where data are in the form of huge corpus and dictionaries. But this applies as well to computer science where large amount of data are stored in linear files. And this is also the case, for instance, in molecular biology where biological molecules are often approximated as sequences of nucleotides or amino acids. Thus the need for more and more faster solutions to text searching problems. Applications require two kinds of solutions: online and offline string matching. Solutions based on the first approach assume that the text is not preprocessed and thus they need to scan the text online, when searching. Their worst case