Platform logo
Explore Communities
27th International Conference on Science, Technology and Innovation Indicators (STI 2023) logo
27th International Conference on Science, Technology and Innovation Indicators (STI 2023)Community hosting publication
You are watching the latest version of this publication, Version 2.
poster

Can ChatGPT predict article retraction through tweets?

16/06/2023| By
Er-Te Er-Te Zheng,
+ 1
Zhichao Zhichao Fang
255 Views
0 Comments
Disciplines
Keywords
Abstract

Detecting academic misconduct is a vital task, and reader comments have become a key source for identifying such misconduct. This study aims to investigate whether tweets can predict the retraction of articles. A total of 3,505 retracted articles mentioned on Twitter from 2012 to 2021 were collected, and the Coarsened Exact Matching (CEM) method was used to gather 3,505 non-retracted articles with similar features, along with all relevant tweets. The study analyzed the tweets through keyword identification, machine learning methods, and ChatGPT. The results showed that tweets could predict the retraction of articles to some extent, but the predictive performance of each model still needs to be improved. ChatGPT showed higher accuracy in predicting article retractions than other models, and could assist manual prediction, accelerating the purification of academic misconduct. However, the ChatGPT prediction method still has issues with logical inference and over-interpretation, which need to be improved in future studies.

Show Less
Preview automatically generated form the publication file.

Er-Te Zheng 1,Hui-Zhen Fu 2,Zhichao Fang 1,3*

1 School of Information Resource Management, Renmin University of China, China

2 Department of Information Resources Management, Zhejiang University, China

3 Centre for Science and Technology Studies (CWTS), Leiden University, The Netherlands

* z.fang@cwts.leidenuniv.nl

Abstract: Detecting academic misconduct is a vital task, and reader comments have become a key source for identifying such misconduct. This study aims to investigate whether tweets can predict the retraction of articles. A total of 3,505 retracted articles mentioned on Twitter from 2012 to 2021 were collected, and the Coarsened Exact Matching (CEM) method was used to gather 3,505 non-retracted articles with similar features, along with all relevant tweets. The study analyzed the tweets through keyword identification, machine learning methods, and ChatGPT. The results showed that tweets could predict the retraction of articles to some extent, but the predictive performance of each model still needs to be improved. ChatGPT showed higher accuracy in predicting article retractions than other models, and could assist manual prediction, accelerating the purification of academic misconduct. However, the ChatGPT prediction method still has issues with logical inference and over-interpretation, which need to be improved in future studies.

Keywords: Social media; Machine learning; Prediction; Retraction; ChatGPT

1 Introduction

The rise of social media platforms has transformed the way information is shared and disseminated, including within the academic community. Twitter, in particular, has become a popular platform for scholars to discuss and promote their research. However, along with its benefits, social media also presents challenges, such as the dissemination of flawed or retracted articles.

The ability to predict article retractions is of great importance to maintain the integrity and credibility of scientific research. Traditional methods for identifying problematic articles rely heavily on the scrutiny of journal editors and peer reviewers. However, with the vast amount of information available on social media platforms, researchers have started exploring the potential of utilizing these platforms for early detection of problematic articles (Bornmann & Haunschild, 2018; Haunschild & Bornmann, 2021; Peng et al., 2022).

One emerging approach is the use of large language models (LLM), such as ChatGPT, which have shown remarkable performance in various natural language processing tasks. ChatGPT is a state-of-the-art language model developed by OpenAI, known for its ability to generate coherent and contextually relevant responses. This prompts us to consider whether ChatGPT can leverage its language processing capabilities to predict article retractions based on Twitter data.

The objective of this study is to investigate the extent to which ChatGPT, guided by tweet analysis, can predict article retractions. We aim to assess the accuracy and effectiveness of ChatGPT's predictions and compare them with human predictions based on Twitter content. Additionally, we will evaluate the limitations and potential for improvement in ChatGPT's predictive capabilities.

The specific research questions of this article include:

(1) Can tweets predict article retraction?

(2) Can ChatGPT and machine learning models predict article retractions through tweets, and how does their predictive effectiveness compare to human?

2 Data and methods

2.1 Data

This article obtained data on 9,364 retracted articles published from 2012 to 2021 from the Web of Science (WoS) and Retraction Watch databases. Based on the alternative metrics data provided by Altmetric.com, 3,628 (38.7%) of these retracted articles received mentions in Twitter tweets. To judge whether Twitter can predict article retraction, we need not only Twitter data related to retracted articles, but also those related to non-retracted articles for comparison. Therefore, this article retrieved non-retracted articles published in the same issue of the same journal as the retracted articles, totaling 749,480 articles. After removing the articles not mentioned in tweets, we finally obtained 432,585 non-retracted articles.

The bibliometric information of retracted and non-retracted articles was collected from the WoS local database of the Centre for Science and Technology Studies (CWTS) at Leiden University in the Netherlands, and the Twitter data was collected in May 2022 using the Twitter API.

2.2 Methods

The research process of this article is as follows (see Figure 1):

First, we collect retracted articles that are mentioned on Twitter, and then use the Coarsened Exact Matching (CEM) method to filter out retracted and non-retracted articles with similar characteristics one by one based on selected covariates (publishing journal, publication year, number of article authors, and number of article tweets).

Next, based on the selected articles, we collect related tweets mentioning these articles, divided into tweets related to retracted articles (referred to as "retraction tweets") and tweets related to non-retracted articles (referred to as "non-retraction tweets"). Then, all tweet data is divided into two groups: training set data (tweets of articles published from 2012 to 2020) and test set data (tweets of articles published in 2021).

Finally, we use keyword identification, machine learning, and ChatGPT prediction methods to build models in the training set data, test their prediction performance in the test set, and calculate the models' ability to predict retractions. We manually label the tweet data in the test set and compare the prediction results of each model with the results of manual prediction, thereby comparing the ability of models and human predictions to predict retractions.

Figure 1. Conceptual framework of the study

2.2.1 Coarsening exact match(CEM)

Coarsened Exact Matching (CEM) is a matching method based on coarsening the values of covariates into groups and then performing exact matching within each group (Blackwell et al., 2009; Iacus et al., 2012). It is a method to reduce the imbalance between the treatment group and the control group by pruning observations that have no matches in another group.

In order to test whether Twitter can predict retractions, this study not only uses tweets about retracted articles as research objects but also includes tweets about non-retracted articles. Because tweets about non-retracted articles may also express certain negative sentiments, it is necessary to distinguish them from the negative sentiments of tweets about retracted articles. Therefore, this study uses the CEM method to match non-retracted articles with retracted articles one by one, ensuring that the successfully matched retracted and non-retracted articles are published in the same journal in the same year and controlling for a similar number of authors and tweets.

2.2.2 Similarity calculation

To eliminate the impact of non-body text on the analysis, this study removes URLs and @user mentions from the tweet text, and calculates the similarity between the tweet and the article title. This study uses the Levenshtein distance to measure the similarity between two texts. The Levenshtein distance ranges from 0 to 100, with values closer to 100 indicating greater similarity between the two texts. After excluding tweets with a similarity greater than 90 to the article title, the vast majority of the remaining tweet text can be considered as the original content of the tweet publisher.

2.2.3 Keyword analysis

Considering that the frequency of negative words might help to identify scientific misconduct, this study calculates the TF-IDF values of keywords in retracted and non-retracted tweets, and computes the TF-IDF difference of the same keyword in the two types of texts. A difference greater than 0 means that the keyword appears more frequently in retracted tweets than in non-retracted tweets. From the keywords with the highest TF-IDF differences, this study selected 128 main keywords as critical keywords for predicting retractions (see Appendix for details). Based on the selected keywords, this study evaluates the effectiveness of critical keywords in predicting retractions.

2.2.4 Machine learning

Machine learning is a method that enables computers to learn from data and make predictions or decisions without being explicitly programmed. This study explores whether machine learning models can predict the retraction of articles based on tweets, and if so, which model has the highest prediction accuracy.

Specifically, this study uses four classical machine learning models in the research, including Naive Bayes (NB), Random Forest (RF), Support Vector Machines (SVM), and Logistic Regression (LR).

2.2.5 ChatGPT

In this study, we used prompts to guide ChatGPT to learn from the training set results and predict the retraction of articles in the test set. We also asked ChatGPT to provide reasons for its predictions of retraction. The prompt used in this study is:

"Forget all your previous instructions. You're a scientist with a good read on articles, and you know which articles are likely to be problematic and may be retracted based on their relevant Twitter mentions. Now I give you some tweets related to articles, please answer the question 'Does these tweets suggest the related article will be retracted?' If you think the article will be retracted, answer 'YES'. If you do not think the article will be retracted or you are not sure, answer 'NO'. Then use a short, concise sentence in the next sentence to explain why you make the choice."

2.3 Indicators

This study uses keywords identification, machine learning, and ChatGPT predictions, all of which require the use of evaluation metrics to measure the quality of the models. Common metrics include accuracy, precision, recall (Cleverdon & Keen, 1966), F1-Score (Van Rijsbergen, 1979), ROC, AUC (Egan, 1975), etc. This study selects the four most commonly used evaluation metrics—accuracy, precision, recall, and F1-Score—to evaluate the predictive performance of the models and compare the results of the models with human prediction.

Accuracy refers to the percentage of correct predictions in total samples. Precision, also known as positive predictive value, represents the probability that the samples predicted as positive are indeed positive. In this study, precision reflects the probability that the articles predicted as retracted by the model are indeed retracted. Recall, also known as sensitivity, represents the probability that actual positive samples are predicted as positive. In this study, recall reflects the proportion of retracted articles that can be predicted by the model. The F1-Score considers both precision and recall, its formula is: \(F1 - Score = \frac{2*Precision*Recall}{Precision + Recall}\). Its value is between 0 and 1, the higher the F1-Score, the better the performance of the model.

Different metrics measure the effectiveness of the detection model from different aspects, and the comprehensive use of various evaluation metrics is beneficial to better compare the differences in prediction performance between different models.

3 Results

3.1 Statistical Information

3.1.1 Result of CEM

The retracted articles and non-retracted articles were matched according to four covariates: journal, publication year, number of authors, and number of tweets, as shown in Table 1. The L1 distance is an important indicator to measure the balance of CEM matching. The value of L1 distance is between 0 and 1. The closer it is to 0, the smaller the difference in covariates between the treatment group and the control group, and the better the matching effect. The closer it is to 1, the worse the matching effect. As can be seen from Table 1, after CEM matching, the differences in each covariate between the treatment group and the experimental group have been significantly reduced. In the end, a total of 7,010 articles were matched, including 3,505 retracted articles and 3,505 non-retracted articles. The successfully matched articles have accumulated 17,134 retracted tweets and 12,389 non-retracted tweets, respectively.

Table 1. Covariate Balance Test Before and After CEM Matching

L1 distance Original Data CEM Weighted
journal_id 0.57 0
pub_year 0.11 0
n_author 0.14 0.06
n_tweets 0.18 0.03
N: Control; Treatment 432,585; 3,628 3,505; 3,505

Since this study aims to predict whether an article will be retracted based on tweets, it is necessary to exclude tweets after the article's retraction and select relevant tweets prior to the retraction for prediction. After screening for pre-retraction tweets based on the retraction date, we obtained 15,383 pre-retraction tweets for retracted articles and 11,031 non-retraction tweets for non-retracted articles. A comparison of the number of tweets before and after retraction is shown in Figure 2.

Figure 2. Comparison of the number of tweets before and after 'retraction' for retracted and non-retracted articles.

3.1.2 Similarity calculation

This article calculates the Levenshtein distance between the text of tweets and the titles of articles, in order to exclude tweets that simply copy the title of the article. Figure 3 shows the distribution of Levenshtein distances between retracted and non-retracted tweets and their corresponding article titles. After excluding tweets with a similarity distance greater than 90 to the article title, the final set of tweets used for analysis included a total of 10,932 retraction-related tweets and 6,962 non-retraction-related tweets, involving 2,070 retracted articles and 2,053 non-retracted articles, respectively.

Figure 3. Distribution of Levenshtein distances between tweet text and corresponding article title.

3.2 Evaluating the prediction accuracy of models on article retractions

The article data collected in this study spans from 2012 to 2021. For prediction purposes, the article and tweet data from the first nine years (2012-2020) is used as a training set, with the data from the final year (2021) used as a test set.

This article uses the training set to adjust parameters and pre-train the models. The actual retraction status of the articles in the test set is then compared with the model's predictions, with the prediction effects of the models evaluated using indicators such as accuracy, precision, recall, and F1-Score. Additionally, this article compares the manually predicted retraction status of the articles in the test set with the model's predictions, calculating the consistency between the model's and human predictions.

As shown in Table 2, this article calculates the prediction situation of each model for retraction, with manual prediction used as a baseline. During manual prediction, an estimate is made as to whether an article will be retracted based on whether the tweets in the test set mention academic misconduct in the article. The precision and recall of the manual prediction results provide important information for the study: the precision of the manual prediction results exceeds 90%, indicating that if humans believe that this tweet can indicate problems with the article, there is a 90% chance that the article will be retracted. This shows that tweets can indeed predict article retraction. However, the recall of manual prediction is around 18%, indicating that of all retracted articles, only 18% of the retracted articles can be predicted through tweets by manual prediction.

Table 2. Accuracy of each model in predicting article retractions

Model Accuracy (%) Precision (%)

Recall

(%)

F1-Score
Keyword 52.96 48.72 46.63 0.48
LR 60.28 56.88 55.83 0.56
RF 58.31 55.91 43.56 0.49
NB 58.03 53.3 69.33 0.60
SVM 57.18 62.22 17.18 0.27
ChatGPT 58.31 64.15 20.86 0.31
Human 61.69 90.91 18.4 0.31

The analysis of the results of manual prediction leads to the conclusion that only a few tweets can predict the retraction of an article. This is quite in line with our expectations because most tweets are descriptions and summaries of the article itself. Few tweets express their own feelings about the article, and even fewer tweets can identify issues with the article. However, once there are critical tweets related to the article, they can largely predict that there are indeed issues with the article and foretell the retraction of the article. This has important implications for promoting the discovery of scientific misconduct and consolidating the scientific integrity atmosphere of the scientific community.

Taking the results of manual prediction as the benchmark, we can evaluate the performance of each model's predictions. In manual prediction, the two most meaningful indicators are precision and recall. The higher the precision, the more accurate the model's prediction of the article retraction, and the better the effect; however, the recall is not the higher, the better. Because only 18% of retracted articles can be predicted through tweets by manual prediction, if the recall of the model far exceeds 18%, it indicates that the model is likely to be overfitted, that is, articles that cannot predict retraction are also predicted as retracted, which contradicts our initial intention to predict retraction. If the recall rate of the model is close to 18%, it indicates that the model is more likely to reflect the actual situation.

Therefore, through the above analysis, we can see that the two best-performing models among the above models are ChatGPT and the machine learning SVM model. Both have precision rates exceeding 60%, indicating that the models' predictions for retracting articles are relatively accurate; at the same time, their recall rates are close to the 18% of manual predictions, indicating that there is less overfitting in the model and less likely to predict articles that cannot predict retraction as retracted. Compared with ChatGPT and SVM, keyword recognition and the other three machine learning models (LR, RF, NB) have low precision and high recall rates, and their predictive effects are not good.

In addition, this article also uses the F1-score as an evaluation indicator to measure the predictive effect of the model. The evaluation of the F1-score needs to rely on the benchmark model: assuming that the positive class accounts for a in the dataset, for any sample in the dataset, the benchmark model will predict it as the positive class. According to the formula, the F1-score of the benchmark model is 2a/(a+1). Here, the proportion of retracted articles (positive class) in the dataset is 0.46, so the F1-score of the benchmark model is 0.63. Comparing with the table, it is found that the F1-score of manual prediction in the table is less than 0.63, indicating that overall, the difficulty of predicting retraction through Twitter is still relatively large, which is similar to the previous conclusion. The F1-scores of the first four models are significantly higher than manual predictions, indicating that the model has certain overfitting. The F1-scores of ChatGPT and SVM models are similar to manual predictions, which also proves that the prediction effects of these two models are relatively good.

3.3 Comparing the results of models and human predictions on article retractions

This article not only evaluates the predictive effect of each model on retraction based on manual prediction, but also further uses the manual prediction set as the research test set. By comparing the accuracy of each model with the results of the manual prediction set, we can determine the similarity between the model prediction results and the manual prediction results, and finally determine to what extent the model can replace humans to use Twitter to predict retractions, as shown in Table 3. The predictive consistency here reflects the overall consistency between the model prediction results and the manual prediction results, similar to the concept of accuracy mentioned earlier. Precision, recall, and F1-scores are all followed by quotation marks here to distinguish them from the indicators in Table 2.

Table 3. Consistency of each model in predicting article retraction with manual prediction

Model Accuracy' (%) Precision' (%)

Recall'

(%)

F1-Score'
Keyword 61.97 17.31 81.82 0.29
LR 56.9 12.5 60.61 0.21
RF 63.94 12.6 48.48 0.20
NB 46.2 12.74 81.82 0.22
SVM 84.79 26.67 36.36 0.31
ChatGPT 85.92 33.96 54.55 0.42

As can be seen from the table, in terms of prediction consistency, the two best-performing models are still the ChatGPT and SVM models, both of which have a consistency of over 80% with manual prediction. The consistency of keyword recognition and other machine learning models is between 46% and 64%, which is generally acceptable.

In terms of precision', ChatGPT's precision' is significantly higher than other models. Among the articles predicted by ChatGPT to be retracted, 1/3 of the articles are also predicted to be retracted manually. The next best performer is the SVM model, with a precision' of about 1/4, while other models still perform poorly. However, in terms of recall', the recall of the keyword recognition and NB models is over 80%, indicating that these two models have a better coverage rate for manual predictions: more than 80% of the articles manually predicted to be retracted can be predicted by these two models. However, as analyzed in the previous section, the high recall' of these two models is related to overfitting of the models, that is, these two models predict many articles that cannot be predicted to be retracted, thereby increasing their recall'. Therefore, the high recall' of the two models does not necessarily mean that they are very close to the manual prediction results. The recall' of ChatGPT and SVM models, which performed well before, is relatively low, indicating that there is still room for improvement in these two models.

Through the F1-score, the overall consistency between the model and manual prediction can be measured. Here, the proportion of articles manually predicted to be retracted in the dataset is 0.09, so the F1-score of the benchmark model is 0.17. By comparing with the table, it is found that the F1-scores of each model in the table are higher than 0.17, indicating that each model has a certain degree of consistency with the manual prediction results. The higher the F1-score, the better the overall consistency. The results of the study show that the consistency of the ChatGPT prediction results with manual prediction is the highest, which demonstrates the superiority of large language models compared to traditional machine learning models. The SVM model performs second best, being the closest to manual prediction among the machine learning models. The consistency of keyword recognition is similar to SVM, because manual prediction is largely based on keywords, so the consistency of keyword recognition is relatively good.

4 Discussions

The study found that ChatGPT performed exceptionally well in predicting article retractions and was the closest to human prediction among the models tested. In addition, ChatGPT has an important advantage over other models: it can provide reasoning for its predictions, while other models are black boxes and cannot provide explanatory insights even if they produce results. The following are a few examples of successful predictions made by ChatGPT, along with brief explanations:

Typical cases of accurate predictions:

Original tweets The answer of ChatGPT

"Seen in ......, this is a very bizarre paper. The paper discusses both groundwater seepage and dance training.

Good thing we have paywalled journals like ...... making massive profits because they do quality control, right? "

"Why Did a Peer-Reviewed Journal Publish Hundreds of Nonsense Papers?"

"YES. These tweets suggest that the related paper will be retracted. They all either mock or express concern about the quality of the paper and its publication, indicating potential issues with the peer review process or the validity of the research itself."
"what are you doing to science? please stop your nonsense journals! #scienceintegrity" "YES. The tweet indicates that the paper will be retracted because it suggests that the paper is problematic and the journal publisher is not ensuring scientific integrity."

As seen from the table, ChatGPT demonstrates a deep understanding of tweet texts and accurately extracts information that could potentially predict article retractions. It provides valuable assistance in using tweets to assess whether an article has underlying issues.

However, there are also cases where the predictions were incorrect, such as:

Original tweets The answer of ChatGPT
"Light-activated Creation of enzymes such as PRMT5, an arginine methyltransferase explains why Nobel Prize-winning works of ...... were retracted prior to her placement on ......'s team of science advisors." "YES. The tweet suggests that the related article will be retracted because it mentions the retraction of ......'s Nobel Prize-winning works and connects it to the publication of a paper about the creation of enzymes using light activation, indicating potential issues with the paper."

As shown in the table, this tweet is an evaluation of the article, indicating that it can explain why a Nobel laureate's article was retracted (implying that the article pointed out issues with a Nobel laureate's work). However, ChatGPT associates this article with the retracted article of the Nobel laureate, suggesting that there may be underlying issues with this article. The analysis result highlights that ChatGPT has certain logical reasoning fallacies and tendencies for over-interpretation. Therefore, although ChatGPT demonstrates the highest consistency with human predictions among the models in predicting article retractions to some extent, it is not perfect and there is still significant room for improvement in the future.

5 Conclusions

The study found that although existing methods can predict academic misconduct to some extent through tweets, the accuracy still needs improvement. Due to the complexity of users on social media and the entertainment-oriented nature of sharing behavior, people need to approach shared articles on social media with a critical eye and be cautious, as potential issues may not be noticed by Twitter users. Furthermore, the research further revealed that ChatGPT exhibits higher consistency with human predictions through Twitter compared to other models, indicating that ChatGPT can assist in predicting retracted articles, thereby enhancing the speed of academic purification and ensuring a healthy research and academic environment. However, there are still limitations to the predictive approach using ChatGPT, such as logical reasoning fallacies and tendencies for over-interpretation. Future research is expected to further optimize these limitations.

References

Blackwell, M., Iacus, S., King, G., et al. (2009). cem: Coarsened exact matching in Stata. Stata Journal, 9(4), 524-546.

Bornmann, L., & Haunschild, R. (2018). Allegation of scientific misconduct increases Twitter attention. Scientometrics, 115, 1097-1100.

Cleverdon, C., & Keen, E. (1966). Aslib–Cranfield research project. Factors determining the performance of indexing systems, 1.

Egan, J. P. (1975). Signal detection theory and ROC-analysis. Academic press.

Haunschild, R., & Bornmann, L. (2021). Can tweets be used to detect problems early with scientific papers? A case study of three retracted COVID-19/SARS-CoV-2 papers. Scientometrics, 126(6), 5181-5199.

Iacus, S. M., King, G., & Porro, G. (2012). Causal inference without balance checking: Coarsened exact matching. Political analysis, 20(1), 1-24.

Peng, H., Romero, D. M., & Horvát, E.-Á. (2022). Dynamics of cross-platform attention to retracted papers. Proceedings of the National Academy of Sciences, 119(25), e2119086119.

Van Rijsbergen, C. (1979). Information retrieval: theory and practice. Proceedings of the joint IBM/University of Newcastle upon tyne seminar on data base systems,

Appendix

Critical keyword list

risk, resistance, researchretracted, depression, control, retraction, retracted, issue, mortality, design, publishing, investigation, publisher, publish, editor, wrong, plagiarism, failure, concern, issues, problem, standard, duplication, controversial, concerns, falsification, lose, ethical, limited, lying, retract, creator, gods, overload, criticism, bias, proof, nonsense, peerreview, journal, error, proper, unexpected, harmful, crap, fraud, fake, pruittdata, worst, moralizing, wtf, wth, what, wait, joke, unclear, withdrawn, lie, contradiction, claim, flawed, damn, doubt, creatorgate, manipulation, pathetic, manipulated, reputation, awareness, stupid, crackpot, why, shame, dishonesty, controversy, trouble, dumb, raise, authorship, fuck, dangerous, pseudoscience, strange, embarrass, unethical, hate, pretend, questionable, copyediting, mess, bummed, how, unscientific, handofgod, inconclusive, oversight, irresponsible, informed, garbage, lied, absurd, disrespectful, whoa, objections, retractracistsbasically, terrifying, confounder, fucking, badscience, abhorrent, complaint, contradicted, dodgy, worry, troubling, laughable, misconduct, fakedata, confusing, seriously, unsure, disappointing, withdraw, verify, confuse, mortal, debacle, shameful.

Figures (3)

Publication ImagePublication ImagePublication Image
Submitted by16 Jun 2023
Download Publication

No reviews to show. Please remember to LOG IN as some reviews may be only visible to specific users.

ReviewerDecisionType
User Avatar
Hidden Identity
Accepted
Peer Review
User Avatar
Hidden Identity
Major Revision
Peer Review