Generative artificial intelligence (AI) and large language models significantly change how disciplines and communities analyze and report research. Leveraging these new tools, such as ChatGPT or Bard, authors can easily generate text and analyses for research articles. As a result, we have already witnessed several instances in which generative AI was used to write a paper or manuscript which unknowingly contained fake citations or false information. The scholarly community needs new indicators to signal, assess, and evaluate manuscripts and research quality to fortify public trust in research. This paper proposes a set of indicators for research integrity that brings much-needed transparency for published research, including those using generative AI technologies. We have then used AI to train algorithms to detect these indicators and applied them to 33 million full-text research publications. We can now see the key indicators as metrics to understand where various fields of research are in communicating and signalling trust.
Show LessLeslie D. McIntosh*, Simon Porter**, Cynthia Hudson Vitale***
*leslie@digital-science.com
https://orcid.org/0000-0002-3507-7468
Vice President, Research Integrity, Digital Science, UK
** s.porter@digital-science.com
https://orcid.org/0000-0002-6151-8423
Vice President, Research Futures, Digital Science, UK
***cvitale@arl.org
https://orcid.org/0000-0001-5581-5678
Director of Science Policy, Association of Research Libraries, USA
Abstract
Generative artificial intelligence (AI) and large language models significantly change how disciplines and communities analyze and report research. Leveraging these new tools, such as ChatGPT or Bard, authors can easily generate text and analyses for research articles. As a result, we have already witnessed several instances in which generative AI was used to write a paper or manuscript which unknowingly contained fake citations or false information. The scholarly community needs new indicators to signal, assess, and evaluate manuscripts and research quality to fortify public trust in research. This paper proposes a set of indicators for research integrity that brings much-needed transparency for published research, including those using generative AI technologies. We have then used AI to train algorithms to detect these indicators and applied them to 33 million full-text research publications. We can now see the key indicators as metrics to understand where various fields of research are in communicating and signalling trust.
1. Introduction
Large language models (LLM), such as ChatGPT or Bard, are a type of generative artificial intelligence (AI) that use machine learning to generate text based on the statistical likelihood or frequency of those words in a dataset. Unlike web search engines that return websites as results, generative AI pulls together text in response to a prompt. In using web-based sources as their training dataset, LLMs produce texts that often seem to be accurate, but may not be given that the text is produced based on this likelihood-model algorithm.
Questions and concerns about the use of text authored by generative AI tools in scholarly publishing surfaced almost immediately after the emergence of ChatGPT and Bard with problems highlighting the difficulty in identifying AI-generated text. According to research conducted by Dr. Catherine Goa, abstracts created by ChatGPT were submitted to academic reviewers, who could only spot ChatGPT generated abstracts 68% of the time. The reviewers also incorrectly identified 14% of real abstracts as being AI generated (Paul, 2023; Gao et. al., 2022).
These reporting guidelines are a critical indicator to ensure research is adequately reported to verify its integrity and potential reproducibility. In many ways, these guidelines and best practices represent a knowledge system for responsible reporting (MIT, 2020). They allow checks on the quality of article reporting rather than the language authenticity.
In the case of research reporting two primary problems have emerged: i) Its knowledge system has developed without an ontology or set of key indicators. Thus, normalising and standardising classes, subclasses, and relationships, through an ontology is a critical first step in alleviating much confusion and burden on users - while also ensuring the quality of the scientific reporting (Jones, 1998). and, ii) Due to the sheer volume of publications even before generative AI, an automated approach to detecting the key indicators is needed.
In our work, we have synthesized the primary reporting indicators based on community engagement to define trust markers of publications. Then we trained, tested, and validated algorithms to automatically detect text within articles (McIntosh et.al., 2023).
2. Methods
Developing Research Integrity Indicators
Based on previous research and building from the Repeatability Assessment Tool (RepeAT) Framework (McIntosh, Juehne & Vitale, 2017). The RepeAT framework was developed through a multi-phase process that involved coding and extracting recommendations and practices for improving reproducibility from publications and reports across the biomedical and statistical sciences, field testing the instrument, and refining variables. This Framework surfaced 5 key classes for assessing reproducibility and 119 subclasses that could be used to evaluate the quality of scientific reporting. While frameworks such as this are thorough, the criticalness of each variable for quality reporting was not evaluated. Further, this early research and framework focused heavily on the reproducibility of research re-using electronic health records.
Starting with the RepeAT Framework, the research team then conducted a comparative analysis of these variables and seven reporting guidelines and one publisher-based research reporting tool. Table 1 includes a list of all guidelines included in the comparative analysis.
Table 1: Reporting Guidelines and Checklist
Reporting Guidelines/Checklist | Study Design |
---|---|
MDAR | Establishes a minimum set of requirements in transparent reporting applicable to studies in the life sciences |
ARRIVE 2.0 | Animal pre-clinical research |
CONSORT 2010 | Clinical trials, Experimental studies |
STARD 2015 | Clinical trials, Diagnostic and prognostic studies, Experimental studies, Observational studies |
SPIRIT 2013 | Clinical trials, Experimental studies, Study protocols |
STROBE | Observational studies |
PRISMA 2020 | Systematic reviews, Meta-analyses, Reviews, HTA, Overviews |
AGREE | Clinical practice guidelines |
SQUIRE | Quality improvement studies |
The goal of this phase in the research was to determine the key elements or critical elements of reporting - now called trust markers. To do this, the research team mapped indicators across these reporting guidelines and collated all sections of the reporting guidelines into a full set of classes.
Trust Marker Algorithm Development
Trust marker development begins by reviewing a corpus of relevant articles (initially around 100) to develop a sense of common practices and locations for reporting the indicator. These definitions form the basis of a data dictionary to create annotation guidelines, which in turn provides accurate and consistent training data for our models.
Once a corpus of relevant papers have been identified we use a tool called Prodigy (https://prodi.gy/) to present the publication text extracted to annotate. Prodigy is an annotation and training tool that allows for rapid NLP model development. The system uses active learning, encouraging human annotators to annotate the documents the current model is least sure about. As a human answers the prompts in Prodigy, the model is updated. After the first round of annotations, we will train alpha models. Our experience indicates that the first round of annotations often requires us to clarify or refine the data dictionary described above. A second round of annotations with a new corpus of approximately 500 articles will then be completed.
The algorithms produced are then applied to the full-text publications within Dimensions.
3. Results
Compiling section names and variables across the RepeAT Framework, the seven reporting guidelines, and the publisher research reporting tool, resulted in over 178 subclasses and 17 classes. These classes and subclasses were then compared across the nine reporting guidelines and the one publisher checklist. These classifications were then organised into three categories of article quality: authorship and attribution, transparency, and reproducibility. The taxonomy of trust markers are shown in Figure 1.
Figure 1: Key Indicators as Trust Markers
The transparency category represents necessary subclasses to support transparency and fidelity of critical research reporting practices. It creates a set of indicators to represent good standards of practice for research communications, and may be external to the manuscript, such as the publishing of a protocol or registration.
The reproducibility category is centered around the elements of a paper which may facilitate a future researcher’s ability to achieve the same results when replicating the original study. The presence or lack of these elements and the subclasses below does not definitively determine the reliability of the authors, merely presents a likelihood.
Trust Marker Algorithm Implementation
When these trust marker algorithms are applied to 33 million full-text journal articles located in the Dimensions database, adoption practices of the trust markers across fields of research become more apparent. Based on the adoption percentages for the calendar year 2021, fields of research are assigned policy implementation bands (Figure 2). For Fields in band 1, there is already well-established practice of reporting the trust markers. It would be reasonable to work towards 100% compliance for all papers. For band 2, there is awareness of the trust marker, but more training is required to shift practice. For band 3, low awareness is assumed and significant education or training is needed in those fields. Blank fields indicate limited data for ethical approval statements either due to algorithm detection or the lack of publications on human or animals (thus not warranting ethical approvals).
Figure 2: Field of Research and Adoption Practice
For Fields the bands represent how prevalent the practice in reporting trust markers: (1) well-established, (2) moderate awareness, (3) low awareness. Blanks indicate the lack of data.
4. Conclusion
The scientific community and publishing industry needs more streamlined methods and approaches to ensuring the responsible reporting of research in manuscripts and assessment of manuscripts that may leverage generative AI. While certain study types and research participant types may require specificity, there are many general indicators that can be standardised and normalised across the research communications ecosystem.
All quality indicators can and are being developed to be automatically extracted from publications (McIntosh, 2021). Current and future work includes detecting the quality of the text, not just the presence of it. For example, we classify where data are said to be shared (or not) within the data availability statement. These classifications are occuring on appropriate trust markers (e.g., funding statement, code availability). What they offer is a faster means to check the quality of the vital piece (manuscript) of scholarly communication. They provide more than a metric of attention. They provide a signal of trust.
References
Equator network. (n.d.). Retrieved April 23, 2022, from https://www.equator-network.org/reporting-guidelines/.
Jones, D., Bench-Capon, T., & Visser, P. (1998). Methodologies for ontology development.In Proc. IT&KNOWS; Conference, XV IFIP World Computer Congress, Budapest
Gao, C. A., Howard, F. M., Markov, N. S., Dyer, E. C., Ramesh, S., Luo, Y., & Pearson, A. T. (2022). Comparing scientific abstracts generated by ChatGPT to original abstracts using an artificial intelligence output detector, plagiarism detector, and blinded human reviewers. bioRxiv preprint. https://doi.org/10.1101/2022.12.23.521610
MIT, Knowledge Meta-Networking for Decision and Strategy. (2020). Retrieved April 30, 2022, from https://gssd.mit.edu/knowledge-system.
McIntosh, L. D., Juehne, A., Vitale, C. R., Liu, X., Alcoser, R., Lukas, J. C., & Evanoff, B. (2017). Repeat: A framework to assess empirical reproducibility in biomedical research. BMC Medical Research Methodology, 17(1). https://doi.org/10.1186/s12874-017-0377-6
McIntosh, L. (2021). Automating quality checks in the publishing process. Advances in Data Mining and Database Management, 251–264. https://doi.org/10.4018/978-1-7998-5589-7.ch013
McIntosh, L.D., Whittam, R., Porter, S., Vitale, CR.H., Kidambi, M.; Science, Digital (2023): Dimensions Research Integrity White Paper. Digital Science. Report. https://doi.org/10.6084/m9.figshare.21997385.v2
Paul, M., When ChatGPT writes scientific abstracts, can it fool study reviewers? (2023). Retrieved April 21, 2023, from https://news.northwestern.edu/stories/2023/01/chatgpt-writes-convincing-fake-scientific-abstracts-that-fool-reviewers-in-study/.
Open science practices
Data and analysis for the checklist are available on our GitHub webpage: https://github.com/ripeta/Research_Reproducibility.
Competing interests
Leslie McIntosh is an employee of Digital Science and the founder of Ripeta, now part of Digital Science. Simon Porter is an employee of Digital Science. Cynthia Hudson Vitale is co-founder and advisor for Ripeta.
Funding information
This research has received no outside funding.
McIntosh, L., Porter, S. & Hudson-Vitale, C. (2023). Research Integrity Indicators in the Age of Artificial Intelligence [preprint]. 27th International Conference on Science, Technology and Innovation Indicators (STI 2023).
No reviews to show. Please remember to LOG IN as some reviews may be only visible to specific users.