Assessing the agreement in retraction indexing across 4 multidisciplinary sources: Crossref, Retraction Watch, Scopus, and Web of Science

Previous research has posited a correlation between poor indexing and inadvertent post-retraction citation. However, to date, there has been limited systematic study of retraction indexing quality: we are aware of one database-wide comparison of PubMed and Web of Science, and multiple smaller studies highlighting indexing problems for items with the same reason for retraction or same field of study. To assess the agreement between multidisciplinary retraction indexes, we create a union list of 49,924 publications with DOIs from the retraction indices of at least one of Crossref, Retraction Watch, Scopus, and Web of Science. Only 1593 (3%) are deemed retracted by the intersection of all four sources. For 14,743 publications (almost 30%), there is disagreement: at least one source deems them retracted while another lacks retraction indexing. Of the items deemed retracted by at least one source, retraction indexing was lacking for 32% covered in Scopus, 7% covered in Crossref, and 4% covered in Web of Science. We manually examined 201 items from the union list and found that 115/201 (57.21%) DOIs were retracted publications while 59 (29.35%) were retraction notices. In future work we plan to use a validated version of this union list to assess the retraction indexing of subject-specific sources.


Introduction
Retraction has been widely studied in scientometric research, often relying on databases such as PubMed and Web of Science to determine which publications are retracted.Only 5.4% of postretraction citations in PubMed Central acknowledged that the paper they were citing was retracted (Hsiao & Schneider, 2021), and a case study posited a correlation between poor indexing and inadvertent post-retraction citation (Schneider et al., 2020).
Retraction indexing may also be lacking in some cases.For example, (Proescholdt & Schneider, 2020) found thousands of examples of apparently retracted papers that were not indexed as such, whose titles starting with "RETRACTED:" or a cognate phrase.Early retractions might also pose challenges: many were issued in non-citable ways such as "tip-in" notices (Snodgrass & Pfeifer, 1992), which did not meet PubMed indexing standards (Kotzin & Schuyler, 1989) and would be missed by retraction indexing.Other studies discovered indexing issues in both document titles and the linking of retracted publications and retraction notices (Schmidt, 2018;Suelzer et al., 2021).However, to date, there has been limited systematic study of retraction indexing quality: we are aware of one database-wide comparison of PubMed and Web of Science (Schmidt, 2018), and multiple smaller studies highlighting database indexing problems for items with the same reason for retraction (e.g., Malički et al., 2019) or same field of study (e.g., Bakker & Riegelman, 2018;Dal-Ré & Ayuso, 2020; among many others).An analysis of PubMed's duplicate publication index in 2013 found 48% (12/25) of retracted publications (identified by publisher notices) did not show retraction status correctly for duplicate publications, and these problems persisted after authors contacted PubMed and editors during a 5-year follow-up period (Malički et al., 2019).38% of mental health articles and 4% of genetics articles marked as retracted in Retraction Watch were not indexed as retracted in PubMed (Bakker & Riegelman, 2018;Dal-Ré & Ayuso, 2020).An analysis of 144 retracted articles in metal health found that only 7% (10/144) of retracted items were marked as such across a variety of publisher sites and database records (i.e., EBSCO databases, MEDLINE and PsycINFO via Ovid, PubMed, Scopus, Web of Science), and of those, the majority only indicated the retraction in one place (Bakker & Riegelman, 2018).
While it is known that retraction indexes are incomplete, there has been no systematic assessment of the extent to which retraction metadata agrees in multidisciplinary databases.This study fills that gap.

Goals and Research Questions
We construct a union list of all DOIs indexed as retracted publications in at least one of four multidisciplinary sources: Crossref, Retraction Watch, Scopus, and Web of Science.We check the extent to which each source agrees with the union list, restricting to each source's coverage.
Our specific research questions are: RQ1: How many DOIs are indexed as retracted publications in each of Crossref, Retraction Watch, Scopus, and Web of Science?Overall, how many DOIs are indexed as retracted publications in at least one source?RQ2: How much agreement does each source have with the union list, restricting to its coverage?RQ3: Does the level of agreement in DOIs indexed as retracted publications vary by field, publication year, or retraction year?RQ4: For a sample of DOIs with less than 100% agreement in retraction indexing, does the publisher's website indicate that they are retracted publications?To address RQ1, we create a list of DOIs that are indexed as retracted publications in one or more of our sources.To do this, we extract metadata about retracted publications as shown in Table 1.

Methods and Data
After retrieving DOIs indexed as retracted publications, we deduplicate metadata within each data source, removing duplicate items with the same DOI.For ease of matching, we also remove items without DOI.Then we combine metadata across the four sources.Each DOI is annotated with a list of the sources that indexed it as a retracted publication, which we call rp_indexed_in.We do not seek to retrieve publications indexed as errata or correction because according to the Committee on Publishing Ethics (2019), retractions should be distinguished from other types of correction or comment.

Methods and Data for RQ2: How much agreement does each source have with the union list, restricting to its coverage?
An item might not be found in a given source on a given search date, because either: the item was not covered by the source; or, the item was covered but is not indexed as a retracted publication in that source.For a given DOI, we poll each source that it is not "rp_indexed_in" (using results from RQ1), to see whether the DOI is "covered_in" the source.We use APIs for Crossref, Scopus, and Web of Science; for Retraction Watch, there is nothing to check because our database dump only covers retracted publications.
In calculating agreement, we will consider a source to agree if it indexes as retracted a publication that is deemed retracted by any one of our sources (including just itself).
Considering the coverage, we quantify the extent of the agreement in retraction indexing for each source: _()

Methods and Data for RQ3: Does the level of agreement in DOIs indexed as retracted publications vary by field, publication year, or retraction year?
Analogous to the RetractionIndexingAgreement_SOURCE above, we also quantify the extent of the agreement in retraction indexing for each source: We then analyze the RetractionIndexingAgreement_DOI across field, publication year, and retraction year.
We (JL, JS) categorize DOIs based on the conference or journal in which they appear.We use Scopus's conference and journal categorization when available for titles on the Scopus source list as of January 2023 3 : publications are one or more of Health Science, Life Science, Physical Science, Social Science, or General.For venue titles not in Scopus, we extract an initial set of topic words by using Yet Another Keyword Extractor 4 on the Scopus source list.Then in an iterative process, we review uncategorized conference and journal titles, and manually curate additional keywords 5 in English and close cognates (e.g., Kardiologie).Titles in other languages or using terminology with multiple potential meanings are left uncategorized.

Methods and Data for RQ4: For a sample of DOIs with less than 100% agreement in retraction indexing, does the publisher's website indicate that they are retracted publications?
We (HZ, JS) examine a sample of about 200 DOIs from our union list that are covered in multiple sources that disagree on their retraction indexing (e.g., RetractionIndexingAgreement_DOI < 100%), to check: Does the publisher's website indicate that they are retracted publications?
To select the sample, we first group DOIs using the pair (RetractionIndexingAgreement_DOI score as calculated from RQ2, field as determined by RQ3) and then select items from each group.We keep other aspects as diverse as feasible, particularly the journal or conference title.
We overselect DOIs with certain features: retraction year earlier than the publication year (especially more than 1 year earlier), having a PubMed ID (since PubMed retraction status is public domain data freely available for reuse), or no retraction year in our data.6Figures 1 and 2 show the overlap between sources.Among the 49,924 unique DOIs, only 1593 (3%) were found in all four sources, with a total of 24471 (49%) purportedly retracted publications found in only one source: 9937 (20%) in Crossref, 8443 (17%) in Retraction Watch, 5056 (10%) in Scopus and 1035 (2%) in Web of Science.

Results for RQ2: How much agreement does each source have with the union list, restricting to its coverage?
The RetractionIndexingAgreement_SOURCE indicates the percentage of covered items, shared with the union dataset, that are indexed as retracted.Agreement is 100% for Retraction Watch, which only provided retracted publications; 95.67% for Web of Science; 92.85% for Crossref; and 62.29% in Scopus.Coverage differs for each database, and Figure 3 compares the number of DOIs from our union list that are indexed as retracted in a source (blue) with those covered but not indexed as retracted (orange) in that source.Coverage was checked April 9, 2023 with the Crossref API, Scopus API8 , and Web of Science API.9

Results for RQ3: Does the level of agreement in DOIs indexed as retracted publications vary by field, publication year, or retraction year?
While publication years range from 1940 to 2023 (Figure 6), interestingly, the first disagreement in for DOIs in our union list is in publication year 2016: about 570 DOIs were covered but not indexed as retracted in some source.The highest disagreement of over 2000 DOIs was recorded in 2019.Figure 6: Publication year distribution for our 49,924 DOIs.
The publication year distribution varies by RetractionIndexingAgreement_DOI, and as shown in Figure 7, agreement of 50% and 66% is found from 2016 forward.By contrast, 25% agreement is found only in publications from 2022; 33% agreement is found only in publications from 2021 to 2023; and 75% agreement is found mostly in publications from 2022 with some from 2021.The retraction year distribution (Figure 8) is roughly similar to the publication year distribution.
Figure 9: Field categorization of the 49,924 DOIs.

Results for RQ4: For a sample of DOIs with less than 100% agreement in retraction indexing, does the publisher's website indicate that they are retracted publications?
We confirmed 114/201 (56.72%)DOIs were retracted publications (including withdrawn or removed) as shown in Table 3.The most common indexing error was retraction notices 59/201 (29.35%). 10We group in "Retraction-related publications" expressions of concern, temporary removals, and retracted and republished articles; removed or purportedly retracted publications whose retraction notice we could not immediately locate; and a few retraction-related publications, such as publications whose duplicates had been removed/retracted.

Discussion and Conclusions
We created a union list of DOIs indexed as retracted in one or more of Crossref, Retraction Watch, Scopus, and Web of Science.Among the 49,924 unique DOIs, only 1593 (3%) were found in all four sources, with a total of 24,471 (49%) purportedly retracted publications found in only one source.Agreement with the union list, taking coverage into consideration, is 100% for Retraction Watch, which only provided retracted publications; 95.67% for Web of Science; 92.85% for Crossref; and 62.29% in Scopus.The retraction year and publication year distribution are roughly similar, with disagreements starting in 2016 and most disagreements in publications from 2021 forward with retraction year of 2022 or later.

Limitations
We only examined a very small number of articles ( 201) manually.Some DOIs indexed as retracted publications were not, in fact, retracted, withdrawn, or removed; many were retraction notices.
We removed 7003 records that had no DOI.We estimate we have lost information about 8-12% of our records (Range is 5928-1126-49=4753 to 7003/[7003+49924]) that have no DOI.Among our sources, Retraction Watch had 5928 records without DOIs; Scopus 49 records without DOIs; and Web of Science 1126 records without DOIs as shown in Table 2.
In calculating agreement metrics, we have a choice in how to handle the DOIs that were uniquely contributed by each source.We have defined our agreement metric to focus on the absence of DOIs contributed by any source (including the source under examination).A stronger metric would consider the presence of unique items a disagreement.

Discussion and Future Work
Disagreement in retraction indexes seems largely to be due to two types of errors: retracted publications with DOIs missing retraction indexing in a source that covers them; and misindexing of DOIs, especially retraction notices and corrigenda.
In the future we would like to better understand how the metadata flows between sources.
Multiple types of problems in the metadata flow seem likely.For example, in examining the data we also find discrepancies between publisher websites and metadata; for example, Figure 10 shows that the retraction year is 2022 from the publisher website, but 2019 from the Crossref metadata.
Figure 10: Discrepancies in data for DOI:10.1016/j.yexmp.2018.12.005 as of April 15, 2023.Left, publisher page from ScienceDirect https://doi.org/10.1016/j.yexmp.2018.12.005Right, data from Crossref http://api.crossref.org/works/10.1016/j.yexmp.2018.12.005 Sharing hand-validated metadata as well as metadata quality procedures could be helpful in the future.Currently, only existing public domain data sources such as Crossref and PubMed can be readily shared.License agreements are another mechanism for sharing; for instance, Clarivate, the parent organization of Web of Science, licenses Retraction Watch data for EndNote and presumably could use it for Web of Science as well.More disagreement was found in items retracted in 2022 and 2023, suggesting that data sharing might be helping, but might need more frequent updating.Our results suggest significant room for improvement in retraction indexing quality in these multidisciplinary sources.Fully automatic processes will not be sufficient for creating a comprehensive union list from our current sources, in their current state of data quality.

Open science practices
We have shared the keywords used to manually identify fields at a temporary sharing link and will ultimately register a DOI for this data: https://databank.illinois.edu/datasets/IDB-8847584?code=Shd4NY0xgh7YWpfIMtAooESBKBcEwkV1LZPmPtXSyzc Data for this study is licensed by each source.Only the Crossref API allows us the right to share the data we've collected.For Retraction Watch Data, we used data available from The Center for Scientific Integrity, the parent nonprofit organization of Retraction Watch, subject to a standard data use agreement.Retracted publications listed in Scopus and Web of Science data can be retrieved from the user interface as shown in Table 1, by database subscribers.Note that checking coverage in Scopus requires specific permission since the Academic Use Case of Scopus API is limited to a single subject area.

Figure 2 :
Figure2: The overlap between sources, limited to the DOIs indexed as retracted within a named source.The total retrieved (to the right of the source name) were either retrieved as retracted publications only from that source (top left number in each box), or shared with 1, 2, 3 other sources.Pairwise overlaps are given in the table to the right.

Figure 3 :
Figure 3: Number of records that are covered but not indexed as retracted; and indexed as retracted in each source.

Figure 4 :
Figure 4: The proportion of our 49,924 DOIs that are: not covered; covered but not indexed as retracted; and indexed as retracted in each of Crossref, Retraction Watch, Scopus, and Web of Science.

Figure 7 :
Figure 7: Publication year distribution for each RetractionIndexingAgreement_DOI score.

Figure 8 :
Figure 8: Retraction year distribution for each RetractionIndexingAgreement_DOI score.Limited to the 43,584 (87%) DOIs with retraction year in our records.

Table 1 .
Retracted publications identified from multidisciplinary sources.

Table 2 .
Our union list has 49,924 unique DOIs that are indexed as a retracted publication by one or more of Crossref, Retraction Watch, Scopus, and Web of Science.As shown in Table2, these were consolidated and merged from the 91,995 records retrieved.After deduplication and checking for DOIs, we get a merged list of 49,924 unique records with DOI.
DOIs are indexed as retracted publications in each of Crossref, Retraction Watch, Scopus, and Web of Science?Overall, how many DOIs are indexed as retracted publications in at least one source?

Table 3 .
Categorization of 201 DOIs we manually checked.Fully distinguishing these categories is difficult because publishers may leave in place the full-text of article as described as withdrawn, or take down the full-text of article they describe as retracted.Of the 201 DOIs we checked, 87/201 (43.28%) were retracted articles, 24/201 (11.94%) were withdrawn articles, and 4/201 (1.99%) were removed articles in our judgement. 11