“You shall know a word by the company it keeps” (J.R. Firth, 1957)
Here at SIoE, we ask big questions about how people learn, teach, and communicate, and now that includes how machines are learning to ‘chat’ like us. Language is the beating heart of education, so it is not surprising that ChatBots, powered by generative AI, are causing palpitations. With generative AI shaking up classrooms, assessments, and everyday communication, a look at the linguistic roots of large language models can tell us more about the possibilities and the limitations of this technology.
Have you ever asked ChatGPT, Gemini or Jen.AI a question and thought, “Wait… did a bot really write that?” You’re not alone. These days, bots don’t just chat, they charm, explain and even argue like the best of us. When you read an AI-generated answer, do you believe that a human with access to the same information may have written those lines? If so, it means that Generative AI has passed the “Turing Test”. This long-standing benchmark for machine intelligence – that a machine can trick a human into thinking they are interacting with another human – has been swept away thanks to large language models. While the furore in the media shows this is a surprise to many, applied linguists have been waiting for this day since the 1950s when Firth, the UK’s first professor of linguistics, announced that “You shall know a word by the company it keeps.”
Linguists have long known that there is something mathematical (and magical) about language. Since 1932, “Zipf’s law” has told us that a word’s frequency in text is inversely proportional to its rank, so that around 1,500 of the most common words (starting with the, of, and, to) account for about 80% of all English text. We know that, overall, any body of texts shows the same pattern: a small number of highly frequent words (or lexical items) do most of the heavy grammatical lifting in language, while a very large number of low frequency lexical items specify topic areas. We can evaluate evidence for Zipf’s law because, since the technology became available in the 1980s, corpus linguistics has analysed increasingly large bodies of machine-readable language, expanded exponentially in the last three decades by the proliferation of texts on the internet. Corpus linguists use the search and pattern-matching abilities of computers to find grammatical and lexical (or lexico-grammatical) patterns. By examining language empirically, it is corpus linguistics that laid the theoretical and practical foundations for the large language models that generative AI depends on.
Corpus studies have provided the data for words that, as Frith predicted in the 1950s, can be known through knowing the accompanying words that regularly share the same locations – their collocations. Words that appear to be very similar in meaning (like start or begin) will invariably collocate with different sets of words (Sinclair, 1991). Words that have multiple senses (like post and wear) will collocate with different words in their different meanings. However, the location of a word may be even more consequential than previously expected – it seems that words are also ‘primed’ to appear at the start, middle or end of a sentence, a paragraph or a text (Hoey, 2004). In fact, it is possible to predict, with a large enough dataset of texts (known as a corpus), the probability of any two words appearing together. This is the principle that drives generative AI “speech”.
While this may sound, computationally, very straightforward, machines struggle with their predictions because it is the social context of the text that decides, when collocations are equally likely, which collocate should be chosen. Words keep the company of other words, but Firth also discussed how they keep the company of a social context. Generative AI does not know the difference between arguing with a taxi driver and consoling a grandparent; they do not understand Register. Register is also magically mathematical – probabilities in lexico-grammar vary according to social relations, the topic and the role of the text – but this is exactly the point where Generative AI fails. It is only with human training that the AI bots can provide a response that a human will consider meaningful based on their contextualising input, and it is here that the tech companies spent big on humans to train their AI bots to understand what is contextually appropriate. Without this very human input, generative AI remains totally ignorant of the human experience.
By recombining our words, generative AI may dazzle us with its fluency, but it still lacks something fundamental: lived human context. It didn’t learn language through experience and education. It can’t know language because it doesn’t know context; it can only calculate the most probable next word. While that may be enough to pass the Turing Test, it’s not enough to understand meaning or intention, or to care, to love or inspire. As applied linguistics has long shown us, language isn’t just about words, but about relationships: relationships between words, between people, and within society. Generative AI may have learned how to mimic those patterns, but it hasn’t learned how to live them. This is exactly where our curiosity lies, not just in how language works, but in how it lives in us as humans, learners, and communicators. As Gen AI increases its hold on our languaging lives, it’s our human relationships that will always remain beyond its grasp and will continue to separate human and bot chat.
Firth, J.R. (1957) A synopsis of linguistic theory. In F.R. Palmer (Ed.) (1968) Selected Papers of J.R. Firth 1952-59. London: Longmans
Hoey, M. (2004) Lexical Priming: A New Theory of Words and Language. Routledge.
Sinclair, J.M. (1991) Corpus Concordance Collocation. Oxford University Press
Dr Nick Moore is a senior lecturer in TESOL in the SIoE.

Leave a Reply