A local cohesion-maximising algorithm for the exploration of publication networks

The dominant approach to the reconstruction of scientific topics is the application of global community detection algorithms. Some of the properties of these algorithms, however, collide with the sociological discussion on topics. We present here for consideration a new local bibliometric algorithm that is in line with sociological definitions of topics and which reconstructs dense regions in bibliometric networks locally.


Introduction
The reconstruction of scientific topics from networks of papers is one main concern of bibliometrics due to its applications in science studies and for science policy purposes (e.g. for the normalisation of citation counts in evaluations). The dominant approach applies global algorithmsalgorithms that partition the whole network by optimising a global quality function with data models based on direct citation or bibliographic coupling and interprets the resulting clusters as topics . The most popular global algorithms prioritise the separation of clusters over their coherence (Held, 2022). This approach is problematic because it inadvertently decouples the bibliometric reconstruction of topics from the sociological discussion of their role in the production of scientific knowledge. Sociological definitions of topics imply an emphasis on coherence and on the perspective of those contributing to the topic, i.e. on a local perspective. Not surprisingly, established bibliometric approaches to topic reconstruction proved unsuccessful when applied to 'ground truths' that were defined by scientists (Held et al., 2021). These problems and their apparent roots in the currently popular algorithms in bibliometrics are reason enough to look for new algorithms that might be more conducive to bibliometrically operationalising a sociological concept of topics. Local algorithmsalgorithms that grow clusters from seed subgraphs until a condition for their termination is metseem a promising solution to the tenets of using only local information and of allowing for overlapping topics. Some of these algorithms use quality functions that maximise cohesion, which corresponds to the sociological understanding of topics as shared perspectives of researchers. In this paper, we present a rationale for using local cohesion-maximising algorithms, present such an algorithm and discuss one application.

A rationale for local density-maximising algorithms 2.1 Theory
An important methodological starting point of our search for approaches to topic reconstruction is the demand that these approaches, as procedures of empirical identification, operationalise a theoretical concept. We define topics as "a focus on theoretical, methodological or empirical knowledge that is shared by a number of researchers and thereby provides these researchers with a joint frame of reference for the formulation of problems, the selection of methods or objects, the organisation of empirical data, or the interpretation of data" (Havemann et al., 2017). The researchers who share such a frame of reference form a scientific specialty or scientific community, i.e. a collective that jointly advances the shared knowledge and has a collective identity (self-perception) of jointly advancing that knowledge (Gläser, 2019;Whitley, 2000). This joint activity is based on intense communication because community members' publications contain contributions that are offered to fellow community members for further use (Kuhn, 1962, p. 19). It is also based on, and strengthens, the thematic similarity of community members' work. Finally, from the definition of a topic as a joint frame of reference follows that a topic is first and foremost a topic to those who work on it. The insider perspective that constitutes a topic is likely to deviate from outsider perspectives, i.e. perspectives of colleagues from other communities. These theoretically derived properties of topics should correspond to properties of subgraphs in publication networks if these subgraphs are meant to represent topics. Dense communication translates into above-average subgraph cohesion in direct citation networks. Thematic similarity translates into above-average subgraph density in bibliographic coupling networks. The insider perspective translates into the use of local information (information about a subgraph and its environment) for its delineation (Held, 2022). Taken together, these translations suggest experimenting with local density-maximising algorithms, which can be applied to traditional bibliometric data models like direct citation and bibliographic coupling.

Local algorithms
In network research, many different local community detection algorithms (LCDA), i.e. algorithms which start locally from a seed in a network and grow a so-called community around it, have been developed (Dilmaghani et al., 2021). LCDAs share the classical idea of a community in a network having "more edges 'inside' [...] than edges linking vertices […] with the rest of the graph" (Fortunato, 2010). This understanding of communities in networks calls for a maximisation of a subgraph's cohesion and separation. While global network partition algorithms must solve this problem by striking a compromise because in a partition neither can be optimised individually (Fortunato & Hric, 2016), local algorithms can focus either on separation or cohesion. Most local algorithms evaluate a community's quality by its separation from its environment, which is measured as conductance (outward edges divided by volume (Hamann et al., 2017)) or local modularity (Clauset, 2005: 2). Only few local algorithms maximise cohesion. These include, among others, -the local tightness expansion algorithm (LTE), which grows the subgraph by adding nodes that increase the subgraph's "tightness" (shared neighbours of nodes inside the subgraph compared to neighbours of nodes inside and outside of the subgraph) and also uses "tightness" as termination criterion (Huang et al., 2011), and -the Triangle Based Community Expansion (TCE) algorithm (Hamann et al., 2017), which adds nodes when they have a large share of triangular relationships with the subgraph compared to the nodes' degree but uses a separation-oriented criterion (conductance) for termination. While some algorithms find their own seed to start from (Dilmaghani et al., 2021b: 762), e.g. by random selection, others have to be provided a user-defined seed. Some algorithms use userdefined seeds as starting point for finding a suitable seed in its surrounding, e.g. by searching for clique(-like) structures that include the seeds (Fanrong et al., 2014) or degree-central nodes (Q. Chen et al., 2013), while others start the expansion directly from the user-defined seed. To our knowledge, only two attempts have been made to utilise local algorithms for bibliometric questions. Havemann et al. (2017) used a separation-based memetic local algorithm for topic reconstruction. (C. Chen, 2018) proposed cascading citation expansion, which is a local approach but not an algorithm for community detection.

The MALBA algorithm
We present for consideration a Multilayer Adjustable Local Bibliometric Algorithm (MALBA), which is inspired by the LTE algorithm but applies its ideas to a multi-layered network. MALBA constructs cohesive communities in networks of papers by iteratively growing a subgraph from a seed, i.e. it operates locally. Publications are added to the subgraph if they are densely connected in at least one of the two data models direct citation or bibliographic coupling, i.e. it operates in a multi-layered network (Figure1). It can also be applied to networks based on only one of the two data models. The thresholds for the density of connections are adjustable by the user. MALBA terminates when no more publications exist whose connections to the subgraph are above one the density thresholds. The separation of the subgraph from its neighbourhood is considered only collaterally because papers that are not connected well enough to be included are in turn better separated from the subgraph. MALBA can be applied to pre-existing networks, from which subgraphs are selected as seeds. In this mode, MALBA can support the exploration of networks by identifying regions of different density. Alternatively, the algorithm can be used to explore a publication database directly by starting from a seed and searching the database for densely connected publications. In this case, MALBA utilises all information about a subgraph's environment that exists in the database but provides less information about less well-connected publications. The interface works with both approaches. Figure   The user can affect the operation of MALBA in three ways: 1) By deciding to work with a pre-existing network or to explore a database. We already mentioned the main differences between the two approaches. 2) By constructing a seed subgraph as starting point for MALBA. The seed has a strong influence on the subgraph both through its size and through the region of the network in which it is located. 3) By deciding on the thresholds. The interface offers the option of automatically identifying the thresholds that return the largest subgraph that can be grown out of the seed with the algorithm terminating (see 3. Experiments). However, the user can also set thresholds manually to achieve an earlier termination of the algorithm. Higher thresholds focus on reconstructing denser regions, lower thresholds also allow to reconstruct less dense regions.
Previous experiments with MALBA revealed the following common behaviours in bibliometric networks: (1) There are small threshold ranges in which the algorithm terminates with a much smaller subgraph than the network or the database because no new nodes can be added. Lower thresholds lead to an exponential growth of the subgraph until it covers the whole network or database.
(2) A minimum size of the seed is necessary for the subgraph to grow at all.
(3) A subgraph grown from a seed can itself be used as a seed for further growth.

Experiments
We report first experiments with MALBA in which we explore publications on a topic from library and information sciencethe h-index. We chose this topic because it makes it easier to understand publications included in the subgraph and in its environment. We discuss the reasons why publications may be included or excluded, the impact of seed sizes, and the impact of thresholds.
In the experiments presented in this paper, we used MALBA to explore directly the stable version from July 2022 of the bibliometric database provided by the German "Competence Network for Bibliometrics", which consists of Web of Science data. When processing publications indexed in this database, we excluded all non-source items because their influence on bibliographic coupling (thematic similarity) and citation (communication) cannot be unambiguously assessed. The ratio-based thresholds for DCin and BC used by MALBA make it more difficult for publications with many non-source items in their reference lists to be included in the subgraph.

Seeds
The algorithm is started with a seed set of seven most highly cited bibliometric publications in the WoS that have "h-index" in their title (Figure 3). If only the seminal paper by Hirsch from 2005 from which the h-index topic emerged (which is not shown in Figure 3) is used as seed, the subgraph does not grow at all. This is not surprising because the original Hirsch paper is a publication outside the field of bibliometrics. When only a subset of the seven publications in Figure 3 is used, the algorithm adds only few publications and terminates at a maximum of 12 publications (Fig. 4, left column).

Results
When starting from the seed consisting of the 7 publications in Figure 3, the subgraph terminates at 805 publications (Fig. 4, middle column), at the thresholds DCin=0.55, BC=0.95, DCout=11. This means that at each stage of growth, publications were added to the subgraph if at least 55% of their references (source items) were publications included in the subgraph, if they shared at least 95% of their references (source items) with references (source items) of the subgraph's publications, or were cited at least 11 times by the subgraph. Lowering any of these thresholds slightly (e.g. DCin to 0.50) leads to an exponential growth of the subgraph without termination. Smaller subgraphs can be obtained by increasing the thresholds. When the seed size is further increased to 15-20 publications and the same thresholds are used, almost the same subgraph emerges. In the surrounding of this subgraph, we find false negativespublications that address the hindex but are not included (FNs, Fig. 4)and true negatives (TNs). An example of an FN is the study by Montazerian et al. (2019) "A new parameter for (normalized) evaluation of Hindex: countries as a case study". It has only 54% of its references in the subgraph and thus did not pass the threshold of 55%. The FNs demonstrate that any threshold is bound to create "near misses", i.e. that a definitive delineation of a topic is not possible. Most of the publications that were not included in the subgraph were TNs, e.g. the paper by Kosmulski (2018) "Are you in top 1%(1‰)?", which has 45% of its references in the subgraph but is not a clear h-index publication. Another TN is the study by Abramo et al. (2013) "The importance of accounting for the number of co-authors and their order when assessing research performance at the individual level in the life sciences" which has only 6 of 21 references in the subgraph and deals with the h-index only marginally. We used the subgraph we obtained with the thresholds given above as seed for a second run of MALBA, with thresholds DCin=0.80, BC=0.90, DCout=11, leading to a termination at 1,320 publications. After this increase by more than 500 publications, some FNs which were found after the first run are still FNs, for example Bertoli-Barsotti and Lando (2015) "On a formula for the h-index". The large increase in publications after the second run led to the inclusion of some previous FNs (the study by Montazerian et al., for example), but, however, also leads to the addition of false positives. For example, the abovementioned study by Abramo et al. (2013) is now included in the subgraph of 1320 publications.

Discussion and future work
Local algorithms like MALBA are fully transparent because for every publication, the reason why it is included in a subgraph can be identified. This makes it possible to explore the match of subgraphs and their environment thematically and to identify true/false positives and true/false negatives with regard to the reconstruction of a topic. While the reconstruction of the h-index topic with MALBA looks promising, we cannot claim yet that MALBA is suitable for reconstructing topics. Further experiments are necessary, including: -a further exploration of subgraphs of h-index publications and their environments; -experiments with MALBA and only one data model (direct citation or bibliographic coupling); -a validation of MALBA with ground truths like the ones used in Held et al. (2021); and -comparisons between the exploration of preexisting networks and the exploration of publication databases.