AI Use in Manuscript Preparation for Academic Journals: Discussion

cover
31 Jul 2024

Authors:

(1) Nir Chemaya, University of California, Santa Barbara and (e-mail: nir@ucsb.edu);

(2) Daniel Martin, University of California, Santa Barbara and Kellogg School of Management, Northwestern University and (e-mail: danielmartin@ucsb.edu).

4 Discussion

Our findings are a starting point for future research and suggest that several issues need to be carefully considered by the academic community. For instance, the community needs to decide which forms of assistance should be reported, whether it be ChatGPT or some other source, such as RA help or help from another AI tool. Additionally, it needs to decide which types of assistance to be reported, be it fixing grammar or something more extensive. We found more consensus that using ChatGPT to fix grammar does not need to be reported than when it comes to rewriting text. One avenue could be to disclose the actual prompts that are used to revise the paper. Along these lines, Grammarly has a new feature that allows users to acknowledge the usage of AI and the actual prompts that the users used.

In addition, our research raises the question of how journals, conferences, and associations can enforce differences in what should be reported. While the detection tool that we employed was able to detect relatively accurately whether AI was used at all, abstracts that were rewritten by GPT-3.5 were sometimes given a lower chance of being written by AI than the grammar-fixed abstracts. This also opens the question of whether using ChatGPT to fix grammar might inadvertently make more substantial changes than desired. How can researchers be sure that they use AI in the desired way? One solution could be to use ChatGPT to point out grammar errors, but to fix them manually, so that ChatGPT does not actually revise the text.

These tools and our perceptions of them will surely evolve, but the aim of this paper is to determine how they are perceived and detected in this moment in time, as it appears to be an inflection point in AI ability and in its use to revise text.

4.1 Limitations and Future Directions

In terms of our survey design, the use of a convenience sample may have introduced selection issues that complicate our comparisons by role and English language background. In addition, our convenience sample was largely composed of economists, and because views might differ across fields, it would be valuable to also consider what perceptions look like more generally and to illuminate differences across fields. An alternative approach that allows for an assessment of perceptions across fields is offered by Bringula (2023), who runs a sentiment analysis on papers written about AI use in manuscript preparation and finds that the sentiment in those papers is generally positive. Also, the use of a convenience sample did not lend itself to randomizing on the form of assistance (ChatGPT, RA assistance, etc.).

In addition, another limitation of our survey is that we do not dig deeply into the nature of ethical perceptions. Since these perceptions were such an important predictor of reporting preferences, it might be insightful to know why academics feel that using AI tools for manuscript preparation is unethical. For example, is someone harmed by their use – such as other academics, science in general, or the authors whose material is used in training the AI – or is there a deeper moral question at play? One way to tease apart these subtleties would be to have academics evaluate a number of detailed vignettes. It might also be interesting to determine the role of payment in ethical considerations, as ChatGPT, RAs, and proofreaders sometimes require payment and sometimes do not.

In terms of our detection design, we just considered published papers, at a top journal, and for the field of management. To understand whether these results hold more generally, it would be necessary to look at papers published in other fields, perhaps using a service like Scopus, and to look at working papers, perhaps sourced from SSRN or arXiv. In addition, it might be valuable to consider full papers instead of just abstracts, or as a middle case between abstracts and full papers, introductions might also be useful to examine.

Additionally, our results are limited to one AI detection service, so we do not know if they extend to other services, such as GPTZero. Also, it might be of interest to see if other AI-based revision services, such as Grammarly, are flagged by AI detectors too. Along these lines, it might be insightful to consider other forms of writing assistance that might be taken besides fixing grammar and rewriting text, or specific forms of rewriting text. One important dimension could be whether use of chat-based and completion-based AI tools lead to different detection rates.

Finally, and related to the last point, it would be valuable to consider a range of different prompts, especially given the sensitivity that we found to relatively small changes in prompt language. One systematic way to choose the prompts could be to hold focus groups of researchers or ChatGPT users or by having a way for researchers to vote on the prompts that are tested. Another dimension of investigation could be to see if our results are sensitive to other important features of the GPT API inputs, such as increasing the model to GPT-4, increasing the temperature to increase the hallucination rate, or by using other system prompts. Given the black box nature of LLMs, a robust empirical analysis is needed to inform policymakers at associations, journals, and conferences of the link between how AI is used and how it is flagged by detection software.

This paper is available on arxiv under CC 4.0 license.