Analyzing the use of email addresses in scholarly publications

Due to recent fraud cases followed by massive retractions of papers, the authors’ use of institutional versus non-institutional email addresses gained a lot of attention. A database was set up with all email addresses in the by-line of articles and reviews indexed in the Web of Science database. In the period 2017-2021, the usage by corresponding authors of institutional email addresses is much more prolific in the Anglo-Saxon and Western European countries than in the BRICS countries. In the latter, corresponding authors use nearly as often a non-institutional as an institutional email address. The journals’ publishing model does not seem of have a large impact on the type of email address used. Most strikingly is the much more frequent usage of non-institutional email addresses in retracted papers compared to non-retracted papers.


Introduction
Since its introduction in 70s of last century, 'electronic mail' became popular and journals started to include the email address of the corresponding authors in the publications' by-line (Wren, Grissom & Conway, 2006). Some journals do not only include the email address of the corresponding author but also the email address of other authors.
The objective of this study is the construction and analysis of a data set based on the email addresses mentioned in the by-line of publications in the journals indexed in the Web of Science (WoS). The section 'Data and Methods' describes the procedure to extract and enrich the data. In the next section a first exploratory analysis is made of its characteristics. In the conclusions these results are briefly discussed in relation to earlier work on the usage of email addresses and an overview of work in progress around the above-mentioned topics is outlined.

Data and methods
Data was collected from the WoS database. We used the in-house version of the WoS database available at the Centre for Science and Technology Studies (CWTS) at Leiden University. This version includes publications starting from 1980 in the Science Citation Index Expanded, the Social Sciences Citation Index, the Arts & Humanities Citation Index, and the Conference Proceedings Citation Index.
We first collected all email addresses available in the database. In a next step we tried to parse them and to identify the domain name. Parsing the email addresses and identifying the domain name is not a trivial task as is illustrated in Figure 1. Selecting everything after the @-sign is not accurate enough because of subdomains. Selecting the part before and after the last dot is also not accurate enough because of multi-part suffixes. By making use of the public suffix list 1 we were able to correctly identify the domain name of each email address. Out of the 664.7 thousand unique domain names we identified in this way, 11,608 turned out to be associated with more than 10 different email addresses and appeared in more than 100 publications. These 11,608 domain names covered 95% of all publications with an email address. Using a rule we then determined for each of the 11,608 domain names whether they could be linked to a scholarly organization. The rule-based approach we used to determine this relies on the Organization Enhanced affiliation data that is available in WoS. More specifically, for each combination of a domain name and an affiliated organization, we determined whether i) the number of publications in which this combination occurs is greater than 10, ii) the percentage of publications in which this combination occurs compared to the total number of publications of the affiliated organization is greater than 5%, and iii) the percentage of the publications in which this combination occurs compared to the total number of publications of the domain name is greater than 30%. If all three criteria were met, we linked the domain name to the organization and classified it as an institutional domain name. The idea of this rule-based approach is as follows. If a certain email domain name appears in the publications of authors from many different organizations, this is an indication that it concerns a domain name of an email service provider. If a certain email domain name is only used in publications by authors of the same organization or a limited number of sub organizations, this is an indication that it concerns a domain name of a scholarly organization.
After applying the rule-based approach, we manually validated and corrected the domain name assignments using Website Informer 2 , an online tool by Informer Technologies, Inc. that can be used to gather detailed information on domain names. We used the free version of the tool to collect information on the domain name and its owner. Where no or not sufficient information was obtained using the tool, the Microsoft Edge browser was used to further determine the organization associated with a domain name.
First, the robustness of the rule-based assignments of domain names to scholarly organizations was manually tested. These assignments turned out to be very accurate. Based on a 5% random sample, only 7 anomalies in the linked domain names to organizations were detected. We then focused on domain names that were not assigned to an organization. It turned out that the rulebased approach was less accurate in this case. We therefore decided to check all unassigned domain names and tried to assign them manually to a scholarly organization or an email service provider. Only 164 (1.4%) could not be assigned, representing only 0.2% of the total number of email addresses linked to all of the 11.608 domain names taken into account.
After applying the rule-based approach and the manual correction, 10.838 domain names were assigned to a scholarly organization and 606 domain names to an email service provider. Based on these assigned domain names, we finally classified all email addresses in the WoS. Email addresses linked to a domain name of a scholarly organization were classified as institutional and email addresses linked to a domain name of an email service provided were classified as non-institutional. All other email addresses were classified as unknown.

Authorships with email addresses over time
In this section, changes over time in the availability of email addresses in publications in WoS are analyzed. Only publications classified as 'article' and 'review' and published in the period 2004-2021 were taken into account, representing 26.3 million publications and 136.7 million authorships (i.e., publication-author combinations). Figure 2 shows the gradual increase of the average number of authors per publication from 4 in 2004 to 6 in 2021. As could be expected, nearly all publications have a corresponding author, referred to in the WoS as the reprint author. During this period, the number of authors with an email address doubled. In 2021, an email address is available in WoS for 1 in 3 authors of a publication.  Figure 3 shows for the same period that the percentage of reprint authorships has increased from 80% in 2004 to almost 100% in 2021. In 2021, in 60% of the publications the last authors' email address is available, slightly more than that of the first author. After remaining around 25%, in the last two years the percentage of authorships with an email address has increased to slightly above 30%. In the period 2004-2021, the use of institutional email addresses by reprint authors has decreased by 10% (Figure 4). In 2021, about 22% use a non-institutional email address. The share of email addresses of reprint authors that could not be linked to one of the two categories has decreased slightly between 2004 and 2010 to about 5% for the more recent years.

Usage of institutional and non-institutional email addresses by reprint authors
In this section, the usage of institutional and non-institutional email addresses by reprint authors in the 9.8 million publications from the period 2017-2021 is analyzed in more detail. In this period, an email address is available for 97% of the 10.4 million reprint authorships.From the reprint authorships with an email address, 73% are identified as institutional, 22% as noninstitutional, and for 4% of the authorships our approach was unable to determine one of these two categories.
For the non-institutional email addresses used by reprint authors, Table 1 provides an overview of the most frequently used email domain names. This list is dominated by three global players (Google, Yahoo. and Microsoft), and by Chinese and to a lesser extent Russian email service providers.  Figure 5 shows the differences between disciplines. About 79% of reprint authors in the physical and engineering sciences use an institutional email address, while in the biomedical and health sciences this is about 70%. The difference in the usage of institutional and non-institutional email addresses by reprint authors is even larger between countries ( Figure 6). Of the top 25 countries with the most reprint authorships, the share of institutional email addresses varies from about 90% for countries like Sweden, Canada, the United States, and the United Kingdom to about 40% for India and Russia. This enormous difference cannot be explained by the share of email addresses that could not be assigned to a scholarly organization or an email service provider: for all these countries this information is missing for about 5% of the reprint authorships with an email address. Looking at the publications' language, major differences are also found between the use of institutional and non-institutional email addresses by reprint authors (Figure 7). For publications in the Russian and Portuguese languages, the use of non-institutional email addresses is dominating.
Almost as pronounced as the differences between countries are those between publishers (Figure 8). Among the top 20 publishers, we see three American and one British learned society where 85% or more of the reprint authorships have an institutional email address. On the other end of the spectrum, in publications from the Hindawi Publishing Group, acquired by John Wiley & Sons in 2021, reprint authors are equally likely to use institutional as non-institutional email addresses.

Conclusion
A database was set up with all email addresses included in the by-line of articles and reviews indexed in the WoS. For about 95% of the authorships with an email address, the associated domain name could be linked to either a scholarly organization or an email service provider.
Email addresses have gradually become available for the publications in the WoS. Today, an email address is available for one in three authors. Striking is the increase of more than 10% in the use of non-institutional email addresses by corresponding authors in the period 2004-2021. A similar trend was observed by Kozak et al. (2015) in their analysis of the usage of email addresses in the 1,000 most-cited and the 1,000 least-cited articles published in 2000, 2005 and 2010.
Limiting the analysis to the period 2017-2021, where the email address of more than 97% of the reprint authorships is available, three results stand out: • The usage of institutional email addresses is much more prolific in the Anglo-Saxon and Western European countries than in the BRICS countries. In the latter the corresponding authors use nearly as often a non-institutional as an institutional e-mail address. These results are in line with those obtained by Shen, Rousseau, & Wang (2018) and Rousseau (2018) for the period 2008-2012. • The journals' publishing model does not seem of have a large impact on the type of email used. • Most striking is the large difference in the use of non-institutional email addresses by corresponding authors of retracted versus non-retracted papers. The usage of institutional versus non-institutional email addresses by corresponding authors and authors of retracted publications has been studied by several authors (see Liu & Chen (2021) and references therein). Compared to the results presented in this paper, these studies are limited to case studies or small samples of retracted publications.
In ongoing work, a more comprehensive analysis of the relationship between the usage of institutional versus non-institutional email addresses and retractions will be made, considering the authors' country, the research field, and the reason for the retraction. As is often overlooked, not all retractions are due to fraud or other misconduct by one or more authors.
Another unsettled research question is the relationship between the use of the two types of email addresses and the number of citations received. According to Kozak et al. (2015), there is no influence on the number of citations, contrary to Shen, Rousseau, & Wang (2018) who conclude that publications with institutional email addresses tend to be cited more. Using the database an elaborated analysis of citation rates is planned.

Open science practices
The classification of email domain names created and used in this paper has been made available in Zenodo (Luwel and Van Eck, 2023). The data used to create this classification and the data underlying the analysis of this paper were obtained from the WoS database produced by Clarivate Analytics. Due to license restrictions, this data cannot be made openly available. The source code used in the analysis of this paper is available in the following GitHub repository: https://github.com/neesjanvaneck/WoS-email-address-analysis.