System Design of Cloud Search Engine Based on Rich Text Content

  • PDF / 684,325 Bytes
  • 14 Pages / 595.276 x 790.866 pts Page_size
  • 95 Downloads / 138 Views

DOWNLOAD

REPORT


System Design of Cloud Search Engine Based on Rich Text Content Hao-peng Chan 1 & Liang Xu 1 & Hui-hui Liu 1 & Run-tian Zhang 1 & Arun Kumar Sangaiah 2,3 Accepted: 20 October 2020 # Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract In order to improve the search performance of rich text content, a cloud search engine system based on rich text content is designed. On the basis of traditional search engine hardware system, several hardware devices such as Solr index server, collector, Chinese word segmentation device and searcher are installed, and the data interface is adjusted. On the basis of hardware equipment and database support, this paper uses the open source Apache Tika framework to obtain the metadata of rich text documents, implements word segmentation according to the rich text content and semantics, and calculates the weight of each keyword. Input search keywords, establish a text index, use BM25 algorithm to calculate the similarity between keywords and text, and output the search results of rich text according to the similarity calculation results. The experimental results show that the design system has high recall rate, high throughput, and the construction time of each data item index in different files is short, which improves the search efficiency and search accuracy. Keywords Rich text content . Search engines . Solr index . Chinese word segmentation . Weighting factor

1 Introduction Search engine refers to a system that automatically collects information from the Internet and provides it to users for query after some sorting. Information retrieval is a widely used technology, its main principle is: facing the actual retrieval requirements of users, based on specific means, the existing information are searched and checked, so as to find the information that meets the requirements. Search engine is a necessary function * Arun Kumar Sangaiah [email protected] Hao-peng Chan [email protected] Liang Xu [email protected] Hui-hui Liu [email protected] Run-tian Zhang [email protected] 1

School of Software Engineering, Jinling University of Science and Technology, Nanjing 211169, China

2

School of Computing Science and Engineering, Vellore Institute of Technology (VIT), Vellore 632014, India

3

Department of Industrial Engineering and Management, National Yunlin University of Science and Technology, Douliu, Taiwan

for the convenience of users in website construction, and it is also an effective tool to study the behavior of web users. Efficient on-site search can make users quickly and accurately find the target information, so as to more effectively promote the sales of products / services. Moreover, through the in-depth analysis of the search behavior of website visitors, it is of great value to further develop more effective online marketing strategies. The term information retrieval was first proposed by Calvin N. Mooers in his master’s thesis of MIT in 1948. In 1954, the U.S. Naval Weapons Center realized the construction of the world’s first