Efficient Online String Matching Based on Characters Distance Text Sampling

PDF / 1,549,492 Bytes
23 Pages / 439.37 x 666.142 pts Page_size
37 Downloads / 367 Views

Efficient Online String Matching Based on Characters Distance Text Sampling Simone Faro1 · Francesco Pio Marino1 · Arianna Pavone2 Received: 23 April 2018 / Accepted: 3 June 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract Searching for all occurrences of a pattern in a text is a fundamental problem in computer science with applications in many other fields, like natural language processing, information retrieval and computational biology. Sampled string matching is an efficient approach recently introduced in order to overcome the prohibitive space requirements of an index construction, on the one hand, and drastically reduce searching time for the online solutions, on the other hand. In this paper we present a new algorithm for the sampled string matching problem, based on a characters distance sampling approach. The main idea is to sample the distances between consecutive occurrences of a given pivot character and then to search online the sampled data for any occurrence of the sampled pattern, before verifying the original text. From a theoretical point of view we prove that, under suitable conditions, our solution can achieve both linear worst-case time complexity and optimal averagetime complexity. From a practical point of view it turns out that our solution shows a sub-linear behaviour in practice and speeds up online searching by a factor of up to 9, using limited additional space whose amount goes from 11 to 2.8% of the text size, with a gain up to 50% if compared with previous solutions. Keywords String matching · Text processing · Efficient searching · Text indexing

* Simone Faro [email protected] Arianna Pavone [email protected] 1

Dipartimento di Matematica e Informatica, Università di Catania, viale A.Doria n.6, 95125 Catania, Italy

2

Dipartimento di Scienze Cognitive, Università di Messina, via Concezione n.6, 98122 Messina, Italy

13

Vol.:(0123456789)

Algorithmica

1 Introduction String matching is a fundamental problem in computer science and in the wide domain of text processing. It consists in finding all the occurrences of a given pattern x, of length m, in a large text y, of length n, where both sequences are composed by characters drawn from an alphabet Σ of size 𝜎 . Although data are memorized in different ways, textual data remains the main form to store information. This is particularly evident in literature and in linguistics where data are in the form of huge corpus and dictionaries. But this applies as well to computer science where large amount of data are stored in linear files. And this is also the case, for instance, in molecular biology where biological molecules are often approximated as sequences of nucleotides or amino acids. Thus the need for more and more faster solutions to text searching problems. Applications require two kinds of solutions: online and offline string matching. Solutions based on the first approach assume that the text is not preprocessed and thus they need to scan the text online, when searching. Their worst case

Data Loading...

Efficient Online String Matching Based on Characters Distance Text Sampling

Recommend Documents

Multidimensional String Matching

Sequential Multiple String Matching

Indexed Approximate String Matching

Sequential Exact String Matching

Sequential Approximate String Matching

Pattern Recognition and String Matching

MM-Stega: Multi-modal Steganography Based on Text-Image Matching

The Distance Matching Problem

Digital Materiality of Chinese Characters in Text-Based Interactive Art

Distance Sampling: Methods and Applications

Role of Distance Measures in Approximate String Matching Algorithms for Face Recognition System

Efficient Large Deviation Estimation Based on Importance Sampling