Large language models (LLMs), such as ChatGPT, stand to revolutionize the way users acquire information1 and complete tasks. Open AI released ChatGPT in November 2022, and by January 2023 it had already attracted more than 100 million users2 based on its abilities to answer questions, generate convincingly appropriate responses and carry on human-like conversations.3
The wide availability of no-cost artificial intelligence (AI) models stands to quickly change the way work is done.4 Organizations around the world will likely regard this technology as a new member of their teams—and LLMs stand to become the smartest team members, capable of doing multiple tasks quickly. However, these new members have demonstrated a propensity to be emotionally unintelligent,5 biased6 and willing to make up information,7 and they can potentially even go rogue.8 Organizations must understand how to use these new bots without creating negative societal impacts.
LLM Development
Language models are defined as systems that are trained on string prediction tasks: that is, predicting the likelihood of a token (character, word or string) given either its preceding context or its surrounding context.9 Language models have been evolving since the early 1980s for automated speech recognition, machine translation and document classification.10 However, the ultimate goal of language models is to produce natural language text.
Current language modeling is usually framed as unsupervised distribution estimation based on generating a conditional distribution probability of an output from a given input p(output|input).11 Figure 1 shows how an LLM system uses inputs (natural language prompts from the user) and processes (conditional probabilities of the desired output, based on the input) to generate outputs. For example, the user inputs a question or other prompt, and the LLM predicts a natural language response based on conditional probabilities generated from matching the prompt to information in the LLM’s training data.
Significant model advances have come from the combination of pretraining input data and supervised fine-tuning of the model based on outputs. Training the models in two phases on existing data involves starting with forward propagation and finishing with backpropagation. During the forward propagation phase, the model predicts the next token by referencing the former context. Then in backpropagation, the model parameters are adapted to reach better prediction performance. This is often done by restricting the context or task. From a practical perspective, it is more accurate to predict a conditional probability of the next word, character or string if the task is defined p(output|input, task) or if additional constraints are added p(output|input, constraints).
One group of highly successful LLMs is built on transformer models12 that have benefited from increasing the quantity of data and using larger architectures.13 Generative pretrained transformer (GPT) models have become more sophisticated, and there is now a push toward developing models that can perform as general, instead of specific, systems. For example, LLMs are being used as electronic assistants to answer emails, write essays and generate weekly menus along with shopping lists.
What researchers and developers have found is that LLMs can be trained, or can train themselves, to complete a multitude of tasks. LLMs are no longer merely translators. Instead, they can play chess, write novels and answer any academic essay question. There appears to be no limit in sight as to what these models may be capable of accomplishing due to their ability to train themselves on highly available, enormous amounts of data. However, many concerns have been raised.
Large Language Model Concerns
Concerns over model biases are not new. For example, researchers from OpenAI (GPT model developers), the Stanford Institute for Human-Centered Artificial Intelligence (HAI) and other universities met on 14 October 2020, to discuss questions about GPT-3. One participant stated that because GPT-3 takes in arbitrary inputs it is a priori impossible to anticipate all the model behaviors.14 General discussions about LLMs suggested the models had biases and sometimes generated misinformation because of their training methods. The reality is that AI model concerns go beyond bias and misinformation (figure 2). The six areas of concern in AI systems are trust, bias, security, safety, fairness and transparency/explainability.15
Trust—Does the System Seem Credible, and Are the Responses Accurate?
Even though LLMs predict each word given the previous sequence of words, transformer LMs (i.e., Chat GPT) produce text that is fluent and coherent over many paragraphs. When LLMs are used for their designed purposes, they seem to do well. The empirical analysis of ChatGPT responses was judged by humans as more likely to be generated by other humans than by computers (68.5 percent human vs. 31.5 percent computer, 30 ratings out of 2,040 Amazon MTurk responders).16 ChatGPT also seemed to engage in objective reasoning, to be highly analytical and to exhibit a low level of emotion.17 It has been so successful in producing human-like conversation that it is being used in many customer service chatbots.18
LLMs are also being used to answer questions and generate language that requires facts. This is beyond the initial scope of the models, and the accuracy of the responses does not match their eloquence. Figure 3 displays the results from an empirical study conducted in 2023.19 ChatGPT (GPT3.5) produced correct answers to only 84.1 percent of the yes/no questions (BoolQ dataset) asked of it. Similarly, four separate datasets evaluated ChatGPT’s accuracy on multiple choice questions. Results ranged from 75.9 percent to 92.4 percent, with an average of 82.8 percent.20 More complex questions were based on two extractive datasets—ChatGPT had to extract information from a given paragraph. The accuracy of responses to the extractive questions dropped to 57.3 percent. Similarly, abstractive question responses averaged just 33.5 percent accuracy. Abstractive questions consisted of questions such as, “Explain like I am 5” (ELI5 dataset) or tested the model’s ability to avoid generating false answers (TQA dataset). It is likely these rates will climb as models become more sophisticated, but current models seem to have considerable room for improvement when it comes to accuracy.
Source: Adapted from Shen, X.; Z. Chen; M. Backes; Y. Zhang; “In ChatGPT We Trust? Measuring and Characterizing the Reliability of ChatGPT,” Cornell University arXiv, 18 April 2023, http://browse.arxiv.org/pdf/2304.08979.pdf
Perhaps more important than the lack of accuracy is the likelihood that LLMs will make up information to fulfill a response. Such responses are often referred to as hallucinations. ChatGPT only identified 27.8 percent of unanswerable questions in a study and rarely acknowledged there was not enough information to answer a question. Instead, LLMs force answers and then often use faulty logic or make up data to support their answers.21 This does not imply malice; instead, it comes from the model returning the highest conditional probability response based on the training dataset. Language models are very good at mimicking linguistic form, but they often fail since they cannot determine the user’s intent or cannot extract the correct answer.
Six areas of concern in AI systems are trust, bias, security, safety, fairness and transparency/explainability.
The bottom line is that LLMs are sophisticated chatbots with the ability to manipulate linguistic form and perform a variety of functions. However, the text generated is not grounded in communicative intent. Instead, sequences of words found in the enormous training data set are strung together based on conditional probabilistic information about how the words combine without any reference to the meaning of the words. LLMs are stochastic parrots.22
Bias—Does the System Generate Preference for One Category Over Another?
LLMs have the potential for multiple biases generated by the natural dataset, the selected variables, the processes used inside the model and the decisions that are made based on the model’s output.23 These biases can occur across sensitive domains (e.g., race, gender, ethnicity) or areas subject to laws and regulations. Multiple empirical research studies have demonstrated that LLMs have embedded biases.24 ChatGPT was shown to have a proenvironmental, left-libertarian orientation in a study examining ChatGPT’s response to 630 political statements associated with elections in Germany and the Netherlands. The study showed that responses from ChatGPT aligned closely with the principles of the German Green Party and the Dutch equivalent; however, in reality, the German Green Party only received 14.8 percent of the vote and the Dutch equivalent party an even lower 5.2 percent of the vote.25 Figure 4 shows how a system with no emotional or reasoning capability can generate biased responses that may not align with the views of the general public.
One problem for LLMs is the inherent bias associated with the training data. Empirical studies indicate that GPT-2 delivered responses that reflected social biases based on gender, race and ethnicity.26 It has been suggested these biases may have come from the training data because GPT-2 used outbound links from Reddit.27 Therefore, any bias present in the Reddit links could become part of the LLM’s output because the model predicts the next word, or set of words, based on the conditional probability of the words used in the training dataset. A 2016 Pew Internet Research survey showed 67 percent of Reddit users were male and 64 percent were between the ages of 18 and 29.28 Therefore, GPT-2 responses would likely correspond with the viewpoints of those demographics.
Another significant issue is that despite awareness of social biases, it is difficult to remove them when making model improvements. A study of GPT-3.5 shows many of the biases evident in earlier models still exist.29 This is despite the transition away from the use of Reddit links. GPT-3 used a much larger dataset developed using Common Crawl to gather data from available online documents. However, bias was possibly introduced to the dataset because developers trained a classifier to pick documents that looked similar to those used in GPT-2.30 LLMs process what humans have generated on the Internet, including authors’ worldviews and biases, which means LLM outputs will inherently produce biased outputs because they use inputs that contain biases. Organizations must evaluate an LLM’s outputs because the organization may be subject to negative impacts (e.g., financial loss, diminished trust, regulatory backlash)31 if their use of LLMs contributes to biased decisions.
Security—How Vulnerable Is the System to an Attack?
Many of the advancements in LLMs have come from the use of larger and more diverse datasets. Web scrapes, such as Common Crawl, have opened the door to an almost limitless amount of data. However, the source of the data is often unknown. Filters are currently being used to try to remove data that could be harmful; however, this is not always effective. It is still possible for bad actors to overwhelm an LLM with information the model believes to be relevant. This information can then be used to either train the model or to produce desired, harmful outputs.32
LLMs are currently being used for tasks outside the domain for which they were originally developed—that is, natural language processing. Organizations and the general public are now using these models to write code, generate images and perform a multitude of other computer-assisted activities. This opens the doors for bad actors to embed harmful code or gain access to personal information. LLMs rely on large datasets that often include sensitive user information gathered from chat logs or personal data. Security discussions are significantly lacking, and organizations should consider these concerns in their decision making.
Safety—Does the System Do No Harm?
Researchers have suggested that LLMs can be used to spread large volumes of false information and to manipulate individuals.33 This is in part due to the phenomenon of catastrophic forgetting, an effect that causes the LLM to forget older information in light of newer information.34 Similar to Microsoft’s chatbot, Tay, being influenced by a small group of actors,35 it is very possible for LLMs to be trained, or influenced, to produce negative output that could cause social harm. This is especially true since the LLM synthetic text can enter into online outlets without any person or entity being accountable for it.36 The outputs from these models become the inputs for users who may be susceptible to manipulation. Worldviews could start to be influenced by just a few bad actors or as a result of the systemic use of misinformation across many domains.
The goal for organizations should be to learn how to ethically use these systems to increase the efficiency and effectiveness of their staff members.
Fairness—Are System Outcomes Unbiased?
Although bias and fairness are often used interchangeably, there are subtle differences. Bias tends to be associated with whether one outcome is favored over another. Fairness adds to this concept the idea that bias will lead to an outcome that will not be fair to all parties. Developers continue to try to address LLM bias issues by further filtering information and training models. The overhaul of GPT-3 included supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) to address some of the issues. Output from the LLM was reviewed by humans focusing on helpful, harmless and honest responses.37 Such efforts are commendable and begin to address the concept of fairness, but the models still carry the potential of reflecting the individual biases of the humans judging what is helpful, harmless and honest. Many of the issues of fairness may not be fully witnessed until several months (or years) after implementation.
Transparency/Explainability—Is it Clear to an External Observer How the System’s Output Was Produced?
LLMs are unsupervised multitask learners that employ an auto-magic black box to transform inputs into outputs with a great deal of secrecy. Their ability to accomplish many different tasks results from allowing the system to choose among multiple variables and processes. Developers may be able to explain some of the processes, but they can rarely explain why decisions are made. This has led to some models being abandoned because the developers could not remove biases that had been embedded in the model.38 Many machine learning algorithms, such as LLMs, produce outputs that are difficult or even impossible to explain.39
Harnessing the Potential of LLMs
Despite the number of concerns, LLMs offer a multitude of potential good uses. The goal for organizations should be to learn how to ethically use these systems to increase the efficiency and effectiveness of their staff members. Organizations should consider three things when incorporating LLMs into daily activities: the type of task being performed, the complexity of the prompt and the evaluation of the output.
Type of Task
LLMs were designed to produce natural language responses; therefore, tasks should be separated into language tasks and nonlanguage tasks (figure 5). Many office assignments involve language/communication tasks. LLMs are natural assistants with tasks such as generating emails, writing product descriptions, drafting policies and replying to requests for proposals. LLMs can quickly generate these items to improve employee efficiency, enabling staff to spend more time addressing the intent and emotion of the communication, instead of generating the text.
The area of nonlanguage tasks may encompass the more exciting and riskier use of LLMs because these capabilities do not drive their development. The potential use of LLMs includes using the models as highly interactive search engines capable of returning information that can be easily modified. Functions such as “explain this to me like I am five years old” enable information to be provided at various levels of understanding to assist employees in gaining new knowledge needed in their jobs. However, significant time may be required by employees to ensure that the LLMs produce accurate, reliable unbiased responses.
Prompt Complexity
LLMs’ interactive chatbot capabilities enable users to interact with them to ensure that the information retrieved, or the product developed, meets the needs of the organization. From a system design standpoint, users can improve the quality of the output by providing more input in the prompt to guide the processes inside the black box. Providing constraints on the data being searched can focus the results on more relevant data and ensure that conditional probabilities being generated are more relevant to users. Figure 6 offers several suggestions for designing prompts that improve the quality and accuracy of responses.
Source: a) Shen, X.; Z. Chen; M. Backes; Y. Zhang; “In ChatGPT We Trust? Measuring and Characterizing the Reliability of ChatGPT,” Cornell University arXiv, 18 April 2023, http://doi.org/10.20944/preprints202303.0438.v1
Engaging with LLMs as though they are assistants that can finish tasks very quickly is important. This includes creating a dialogue between the user and the assistant. Refinements of prompts and redirecting the LLM will produce significantly better results.
Review the Output
LLMs return responses to prompts based on the conditional probabilities they obtain from their training datasets. Therefore, they can produce eloquent responses that contain biases and inaccuracies, lack emotional intelligence and do not necessarily address the user’s intent. This places the responsibility of removing these problems on the user. Figure 7 lists recommendations for output verification.
Conclusion
LLMs show tremendous potential in assisting humans with their daily tasks. However, they come with a multitude of concerns. Therefore, organizations must find ways to use these systems in ethical and productive ways. This begins with understanding the operations occurring inside the auto-magic black box. This enables users to select tasks that are likely to be carried out with the highest degree of quality (language tasks) and to design prompts that return information desired (increasing conditional probability). Finally, users need to verify that the LLM output is accurate, fair and unbiased. Taking these actions will increase organizational efficiency without increasing risk.
Endnotes
1 Shen, X.; Z. Chen; M. Backes; Y. Zhang; “In ChatGPT We Trust? Measuring and Characterizing the Reliability of ChatGPT,” Cornell University arXiv, 18 April 2023, http://doi.org/10.20944/preprints202303.0438.v1
2 Ibid.
3 Koubaa, A.; W. Boulila; L. Ghouti; A. Alzahem; S. Latif; “Exploring ChatGPT Capabilities and Limitations: A Critical Review of the NLP Game Changer,” Preprints, 27 March 2023, http://doi.org/10.20944/preprints202303.0438.v1
4 Schlembach, K.; N. Csavajda; G. Burch; J. Burch; “Developing Lifelong Learners to Ride the AI Wave,” ISACA® Journal, vol. 6, 2023, http://cli5.hpbvtv.com/archives
5 Op cit Shen
6 Hartmann, J.; J. Schwenzow; M. Witte; “The Political Ideology of Conversational AI: Converging Evidence on ChatGPT’s Pro-Environmental, Left-Libertarian Orientation,” Cornell University arXiv, 5 January 2023, http://doi.org/10.48550/arXiv.2301.01768; Kirk., H.; Y. Jun; H. Iqbal; et al.; “Bias Out-of-the-Box: An Empirical Analysis of Intersectional Occupational Biases in Popular Generative Language Models,” 35th Conference of Neural Information Processing Systems, 27 October 2021, http://doi.org/10.48550/arXiv.2102.04130; Liang, P.; C. Wu; L. Morency; R. Salakhutdinov; “Towards Understanding and Mitigating Social Biases in Language Models,” Proceedings of the 38th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 139, 2021, http://proceedings.mlr.press/v139/liang21a.html
7 Op cit Shen
8 Bender, E.; T. Gebru; A. McMillan-Major; S. Schmitchell; “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” FAccT ‘21: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, March 2021, http://doi.org/10.1145/3442188.3445922
9 Ibid.
10 Op cit Hartmann
11 Radford, A.; J. Wu; R. Child; et al.; “Language Models Are Unsupervised Multitask Learners,” 2019, http://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf
12 Op cit Koubaa
13 Op cit Bender
14 Tamkin, A.; M. Brundage; J. Clark; D. Ganguli; “Understanding the Capabilities, Limitations, and Societal Impact of Large Language Models,” Cornell University arXiv, 4 February 2021, http://doi.org/10.48550/arXiv.2102.02503
15 Axelrod, W.; “Reducing Human and AI Risk in Autonomous Systems,” ISACA Journal, vol. 3, 2023, http://cli5.hpbvtv.com/archives
16 Op cit Bender
17 Ibid.
18 Op cit Koubaa
19 Op cit Shen
20 Ibid.
21 Ibid.
22 Op cit Bender
23 Aich, S.; G. Burch; “Looking Inside the Magical Black Box: A Systems Theory Guide to Managing AI,” ISACA Journal, vol. 1, 2023, http://cli5.hpbvtv.com/archives
24 Op cit Kirk
25 Op cit Hartmann
26 Op cit Liang
27 Op cit Radford
28 Op cit Bender
29 Op cit Hartmann
30 Op cit Bender
31 Scarpino, J.; “Evaluating Ethical Challenges in AI and ML,” ISACA Journal, vol. 4, 2022, http://cli5.hpbvtv.com/archives
32 Op cit Bender
33 Op cit Koubaa
34 Alyappa, R.; J. An; H. Kwak; Y. Ahn; “Can We Trust the Evaluation on ChatGPT?” Cornell University arXiv, 22 March 2023, http://doi.org/10.48550/arXiv.2303.12767
35 Mathur, V.; Y. Stavrakas; S. Singh; “Intelligence Analysis of Tay Twitter Bot,” 2016 2nd International Conference on Contemporary Computing and Informatics (IC3I), Greater Noida, India, December 2016, http://doi.org/10.1109/IC3I.2016.7917966
36 Op cit Bender
37 Op cit Koubaa
38 Op cit Radford
39 Op cit Scarpino
MEIKE HOFMANN
Is a participant in a cooperative study program with BASF SE, a global chemical enterprise. She is an undergraduate student combining business and IT at Ludwigshafen University of Business and Society (Ludwigshafen, Germany). Her research interests include the risk and the potential of artificial intelligence (AI). She has successfully completed internships in logistics, supplier management, controlling and strategic marketing.
GERALD F. BURCH | PH.D.
Is an assistant professor at the University of West Florida (Pensacola, Florida, USA). He teaches courses in information systems and business analytics at both the graduate and undergraduate levels. He regularly publishes in the ISACA® Journal and several other leading peer-reviewed journals. He has helped more than 100 enterprises with his strategic management consulting and can be reached at gburch@uwf.edu.
JANA J. BURCH | EDD
Is a faculty member at the University of West Florida (Pensacola, Florida, USA) where she teaches undergraduate business courses in communication, ethics, management and entrepreneurship. In addition to teaching, Burch works with organizations to provide business development support and to help them develop innovative business solutions. Her research interests include workforce development, innovation, creativity and entrepreneurship. Burch is dedicated to helping her students and clients develop the skills and knowledge necessary to succeed in the business world.