Validation of the Astro dataset clustering solutions with external data
- PDF / 841,472 Bytes
- 27 Pages / 439.37 x 666.142 pts Page_size
- 84 Downloads / 183 Views
Validation of the Astro dataset clustering solutions with external data Paul Donner1 Received: 14 July 2020 © Akadémiai Kiadó, Budapest, Hungary 2020
Abstract We conduct an independent cluster validation study on published clustering solutions of a research testbed corpus, the Astro dataset of publication records from astronomy and astrophysics. We extend the dataset by collecting external validation data serving as proxies for the latent structure of the corpus. Specifically, we collect (1) grant funding information related to the publications, (2) data on topical special issues, (3) on specific journals’ internal topic classifications and (4) usage data from the main online bibliographic database of the discipline. The latter three types of data are newly introduced for the purpose of clustering validation and the rationale for using them for this task is set out. We find that one solution based on the global citation network achieves better results than the competitors across three validation data sources but that another solution based on bibliographic coupling performs best on the special issues data. Keywords Cluster validation · Document clustering · Structural bibliometrics
Introduction Given the huge scale of published research and the accelerating growth of new publications, automatic grouping of documents, that is, clustering, has emerged as an important task in structural bibliometrics. Clustering scientific document collections has applications in information retrieval, knowledge organization and field delineation, field normalization of indicators, and the visualization or mapping of document collections. Scientific document clustering is by now a well-established subdiscipline of bibliometrics. New methods are developed and older methods are improved constantly. Until recently, there was no standard dataset on which such contributions could be benchmarked against prior state-of-theart results. However, with the introduction of the Astro dataset, such a central testbed is
* Paul Donner [email protected] 1
Deutsches Zentrum für Hochschul- und Wissenschaftsforschung, Schützenstraße 6a, 10117 Berlin, Germany
13
Vol.:(0123456789)
Scientometrics
now available to the community (Gläser et al. 2017). We contribute to this development by complementing the publication records dataset with external validation data. We also study the performance of the publicly released clustering solutions of the Astro dataset1 with these validation datasets to gain important insights into the strengths and weaknesses of the underlying document clustering approaches. Thus the major contribution of this study is to show that it is possible to judge the quality of publication clustering solutions, as operationalized by the correspondence of solutions to latent topic structure reflected in several different validation datasets. The introduction proceeds with a discussion of clustering. We outline the motivation for this study and its significance and discuss the background of the clustering comparison initiative of whic
Data Loading...