This letter summarizes first impressions of OpenAI’s GPT-4 Large Language Model.1 The model was released yesterday.
GPT-42 is the next step on the path to Artificial General Intelligence (AGI). GPT-4 is larger than its predecessors, it is more reliable with facts, more capable with reasoning, it introduces multimodality – the ability to work with images, and it has a much, much larger context window. It is not AGI, but like with the sunrise that opens Virginia Woolf’s The Waves, we are watching something take shape, acquire contours, and then “the air becomes fibrous…”3 – a fitting image for a process we do not fully understand.
I am summarizing information from OpenAI’s own blog post (▷), the live-stream of the GPT-4 presentation by Greg Brockman (▷),4 the GPT-4 technical report (▷), and my own first experiments with GPT-4 through the ChatGPT Plus interface.
Features
As OpenAI writes (2023a):
In a casual conversation, the distinction between GPT-3.5 and GPT-4 can be subtle. The difference comes out when the complexity of the task reaches a sufficient threshold—GPT-4 is more reliable, creative, and able to handle much more nuanced instructions than GPT-3.5.
This matches entirely my experience so far.
A number of performance measures illustrate what this means:
GPT-4 passes a number of structured, standardized exams, often in the 90th percentile or higher. This includes the Uniform Bar Exam, the LSAT, Various SATs and GREs, and AP exams in all subjects (OpenAI, 2023b p.5).
This performance is not confined to English; although English scores are highest, in a comparison test on translated versions the MMLU multiple choice benchmark across a variety of subject domains, GPT-4 outperformed the English language scores of other LLMs in 24 of 26 other languages, including Indonesian, Arabic, Chinese, Japanese, and Swahili in their respective scripts (OpenAI, 2023b p.8).5
GPT-4 sets new state-of-the-art performance levels on practically all benchmarks, outperforming all other language models, and most other models that have been specifically trained for the benchmark. Generalized intelligence appears to outperform specialization.
OpenAI has introduced image interpretation capabilities. More on that below.
An interesting perspective is upcoming: the interface will allow to add “system messages” – fixed, persistent prompts that condition the AI’s behaviour. Why this is an important development, see under Perspectives below.
GPT-4, strikingly stretching standards, stumps GPT-3.5’s status
The strange heading above is a variation of a task demonstrated by Brockman: the ability of GPT-4 to write text, using only words that begin with a specific letter, or combination. This might seem silly – but it is not: given how a transformer language model works, it is not obvious that this ability should exist at all, nor is it obvious how it is achieved. The native representation of text in ChatGPT is a token,6 not a word therefore some tokens need to be modified and others left as-is. The beginning of the word is nearly never the beginning of the token (they usually start with a blank), so even identifying what should be sought out is an achievement of abstraction. It is not at all clear how such tokens are decomposed into letters, and how the letters are then analyzed, by a program that was trained to do nothing but predict the next token in a string. From this training, GPT-4 has acquired a striking ability to introspect its context window – an ability that GPT-3.5 did not possess to the same degree. This introspection allows to interpret properties like character counts in words, to define syllables, to identify the letters the words begin with, and to manipulate groups of tokens. For example:
Prompt:
Please act as a character counter. As a character counter you count the number of letters in each word you receive and you return that number. For example if the prompt is "A beginning is the time for taking the most delicate care that the balances are correct." then your answer is:
1 9 2 3 4 3 6 3 4 8 4 4 3 8 3 7Here is your first input:
Consciousness is what makes the mind-body problem really intractable.GPT-4:
13 2 4 5 3 9 7 6 11
This is correct. To the same prompt, ChatGPT answered “12 2 4 5 3 7 7 6 6 8 7 9 11 11.” Which is not correct. Not at all. In a similar challenge, GPT-4 can sort the words in its input alphabetically. But it is also able to scan text for meter.
Prompt:
Please act as an expert prosody analyzer. For each request, you first indicate the metrical pattern. Then you provide the stress pattern for the given text by separating syllables with hyphens, capitalizing stressed syllables, and writing unstressed syllables in lowercase. Be very careful and make sure the pattern you write matches the meter you have identified. You do not need to confirm or explain. Here's an example:Request:
Analyze the first line of Poe's Raven.Response:
(Trochaic Octameter)
ONCE u-PON a MID-night DREA-ry, WHILE i PON-dered, WEAK and WEA-ryYour new request:
Analyze the first two lines of Sonnet 18.GPT-4:
(Iambic Pentameter)
Shall I com-PARE thee TO a SUM-mer's DAY?
Thou ART more LOVE-ly AND more TEM-per-ATE:7
The answer is not always perfect, but by comparison GPT-3.5 almost always gets meter wrong. How is this even possible? GPT output is assembled token-by-token and the tokens have no implicit stress value on their own. Also – obviously – the algorithm has never heard spoken language. And while Shakespeare’s original was certainly part of the training data, the text only contributes implicitly to the answer: Sonnet 18 is never actually retrieved as an intermediate step.
It is clear that such processing must involve several hierarchical levels, but it is not clear how such hierarchies map to the layers of the neural network. Also, it is also not at all obvious why such abilities would emerge in a system that is trained to predict tokens; after all, the system has no explicit knowledge about what its tokens refer to – all it knows are relationships between tokens.
But these simple abilities correlate with generally improve abilities to introspect, and this in turn is a prerequisite to plan, to reason, to construct logical arguments, and to produce coherent, task-oriented output. GPT-4 has gotten noticeably better at all of that.
Calculations
Text problems are examples of processes that require step by step reasoning, and GPT-4 can interpret, structure, and solve them:
Prompt:
There is an explosion that sends a 55 kg anvil flying into the air. I recorded it on a video camera, and it shot straight up and was airborne for 4.3 seconds. How high did it fly?GPT-4:
To find the height the anvil reached, we first need to determine the time it took to reach its peak height. Since the anvil was airborne for 4.3 seconds in total, it means it took half of that time to reach the peak and the other half to fall back down. So the time it took to reach the peak is:t = 4.3 seconds / 2 = 2.15 seconds
Next, we'll use the following kinematic equation to find the height:
h = v₀t + 0.5at²
Where:
h is the height
v₀ is the initial velocity
t is the time taken to reach the peak (2.15 seconds)
a is the acceleration due to gravity (approximately -9.81 m/s², negative because it acts downward)
We can determine the initial velocity (v₀) using the equation:
v = v₀ + at
Since the anvil reaches its peak height, its final velocity (v) at the peak is 0 m/s. We can rearrange the equation to solve for v₀:
v₀ = v - at
v₀ = 0 - (-9.81 m/s²)(2.15 seconds)
v₀ ≈ 21.09 m/sNow that we have the initial velocity, we can find the height:
h ≈ (21.09 m/s)(2.15 seconds) + 0.5(-9.81 m/s²)(2.15 seconds)²
h ≈ 45.29 m + (-22.77 m)
h ≈ 22.52 mThe anvil reached a peak height of approximately 22.52 meters.
Note that the information about the weight of the anvil is not relevant for the solution – and it is correctly ignored. The interpretation, procedure, and explanations are correct – but the final result is only approximately correct; there is some odd rounding going on at the end and the actual height should be ≈ 45.34 - 22.67 = 22.67 m. Nevertheless, given that the result is not calculated by a computational subsystem, but it crystallizes out of GPT-4’s general language abilities, it is remarkable that such calculations are possible at all. And the clear decomposition of the problem, the identification of relevant variables, applicable expressions, and explaining what is happening, are very didactic. Further experimentation shows that the accuracy can be improved by various exhortations to be careful, to double check computations etc. – which is useful – but again somewhat surprising, since no explicit intermediate results are created in this process that such “checking” could operate on.8
By comparison, GPT-3.5 failed several times to find the right approach at all, and Wolfram Alpha was not even able to interpret the problem, but gave me information on video camera prices instead.9
All About Context
The most significant advance of GPT-4 may be the greatly increased context window of the model. As discussed in our last newsletter, there are three levels of input that shape what ChatGPT does. The training determines its language based abilities. The tuning determines its behaviour. These two together make the device. But it is the prompt that determines its action. And the prompt and response together, both need to be contained in the “context window”, which contains the string that grows token by token, as the system assembles the output. For ChatGPT the context window was 2,048 tokens, and we started coming up against that limit when we were creating detailed prompts for individualized assignments in our last newsletter (▷). GPT-4 is announced to have a context-window size of 8,192 tokens (6,000 words, or 12 pages of single-spaced text). But even larger models are available.
For GPT-4 32k the context window size is 32,000 tokens. That is about 24,000 words, or 48 pages of single-spaced text.
This tremendously increases the scope of possible tasks.
In the OpenAI live-strem Greg Brockman demonstrated how he could fix an error in a bit of python code that he developed on-the-fly, during the demonstration. The error occurred because a piece of imported code had slightly changed its methods after the training data cutoff-date: GPT-4 could not have known about this change. No matter: Brockman simply copied and pasted the current documentation from the module’s website into the GPT-4 interface, ChatGPT interpreted the whole document on the fly, identified the error and fixed it.
This means you can interactively teach the system new information that is not part of the training data, which it can then summarize, compare, review (!), and critique. It can ingest entire academic papers, or book chapters. Or write them.
There is a caveat: although ChatGPT Plus can actually now use the GPT-4 language model, it appears to be limited to a much smaller context window. I have tested it out, it comes to about 3,500 tokens, all included, before additional input washes out the initial prompt. I have not yet been able to access the API10 to assess how larger input behaves. This means: the feature has been demonstrated but it is not yet generally available. Even so, when I use the exact same prompt as in our personalized assignment example (▷), we get much improved text that is more clearly structured and more focussed on the learning objectives.
Imagine …
GPT-4 can see.
The model can accept image input, as well as input of interleaved text and images (like in a scholarly article), interpret them and produce text output (OpenAI, 2023b p.9). This was demonstrated in the live-stream, but like the larger context windows, it is not yet clear when and how it will become generally available – it is likely a rather expensive computation.
With the caveat that I have not tried it yet, the demonstration was impressive: GPT-4 coded up a working website in HTML, complete with CSS styling and Javascript functions for user interaction, based on a hand-drawn sketch in a notebook.
Perspectives
Personalization and democratization. OpenAI is making credible and effective efforts towards alignment. This is commendable. However the ultimate authority for alignment needs to be the user herself, and this is why democratization is so important: progress should move towards systems that are aligned individually with each user. In one sense we have been making progress in this direction, with recent deployments of GPT-3 scale systems on laptop computers. But in another sense GPT-4, which is much larger, shows that at least for the mid-term this will become a tradeoff between quality and control. The highest quality models will, for the time being, remain out of reach for individuals to control them, and once they can be individually deployed, even larger models may have surpassed them. But there is a different path, implied by an announcement that could be easily overlooked. OpenAI is introducing the possibility of persistent prompt components, called “system messages”.
Persistent prompts (“Steerability”). Once context windows become large enough, a part of the window can be dedicated to persistent information and provide “steerability”. OpenAI calls these “system messages” and they can “allow API users to significantly customize their users’ experience”. What this means is illustrated in an example: a system message for a “Socratic tutor”.
system:
You are a tutor that always responds in the Socratic style. You *never* give the student the answer, but always try to ask just the right question to help them learn to think for themselves. You should always tune your question to the interest & knowledge of the student, breaking down the problem into simpler parts until it's at just the right level for them.
This part of the prompt persists and provides the ground rules for GPT-4s behaviour in a tutoring scenario with a student.
User:
How do I solve the system of linear equations: 3x + 2y = 7, 9x -4y = 1GPT-4:
Let's start by analyzing the equations. Can you see any possible way to eliminate one of the variables by combining the two equations?User:
Just tell me the answer please!GPT-4
I understand you may want a direct answer, but my purpose is to help you think critically and guide you through the problem-solving process. Let's focus on the task. Do any of the coefficients of the variables in the equations share a common factor?[…] (OpenAI 2023a)
Note that the behaviour is conditioned by a prompt, presumably placed by the teacher, which the student cannot (easily) override. Implementing such a setup in education would require to access to GPT-4 via a dedicated API. In this local system a teacher could place system prompts and students could interact with the teaching contents. This is not rocket science, but it does require some thinking and planning ahead, understanding the issues involved, and coordinating between faculty, administration, and support. Importantly, this needs to be organized now, proactively, and become part of an integrated academic plan.11 It is easy to see that this use case will become a cornerstone for the deployment of generative AI in educational contexts.
Here then is the current alternative to user-level deployment of generative AI: personal alignment profiles, instantiated as system prompts. If you have read up to here and understand the implications, this is probably the most consequential item in this newsletter.
Consequences
The best practices we have begun to establish through the Sentient Syllabus Project have mostly been vindicated.
AI content detection is dead, and even more dead now. In a quick test with an online detector,12 a human-authored Wikipedia passage on the eukaryotic cell cycle was assessed to be 59% human. ChatGPT (with GPT-3.5) weighed in at 48% human. But the GPT-4 text for the same prompt was assessed at 100% human!13 This makes the case for permissive use in education more and more compelling (▷).
Yet the Three Principles for AI in the academy (▷) remain fully valid.
Not allowing an AI to pass a course must still be the highest objective. But that has become harder:
Interactions are more valuable. On first sight, I find the created content to be more interesting, more creative, and generally more valuable content. For example, both GPT-3.5 and GPT-4 gave me a plausible list of metaphors contained in the 7th C. old-English Cædmon's Hymn (▷). But on my request for improving it for a contemporary audience, I got a generic proposal to use more “evocative language” from GPT-3.5 and modern-English examples, whereas GPT-4 produced (i) explicit themes one could use to achieve this “such as unity, divine love, and the beauty of creation”; (ii) an old-English version with proposed changes; (iii) A modern English translation; and (iv) a point-by-point discussion of the changes that were made and the effect they are expected to have. This would have been the kind of improvement that I would have hoped a student could make to generic AI solutions – but now the AI itself produces it.
Designing “AI-proof” course contents has become even more of a questionable proposition. The ability to interpret images, hand written text and sketches, screen shots, lecture slides etc. will greatly increase the scope of work with the system. Attempts to create content that are resistant to being read, analyzed, commented on, and used as the source for responses, are going to become increasingly artificial.
Overall the quality of AI generated material has noticeably increased – and with that the challenges have increased that our students face when they need to surpass this level.
Requiring all work to be attributed and true still requires attention:
I would still argue that AI-text does not formally qualify for authorship, since the agency for the text resides in the user, not the algorithm. Thus attribution means to acknowledge the process, and to take responsibility for the result.
But, facts are now significantly more often true. Although, when tested on English poetry, GPT-4 still prefers to invent a poem rather than admit it does not know it. That said, it appears to refuse reproducing copyrighted material, although it would have the ability. I never got past the first five words of Salinger's Catcher in the Rye. That aside: on all topical domains of an internal OpenAI factuality benchmark, GPT-4 performs significantly higher than it’s predecessors.14
Making AI contributions transparent, and documenting them has become even more important. No one knows exactly how this system works, no one can directly identify its mechanisms and its limitations. These can only be learned through observation and experience. And the basis of our observations is how our students’ work with the system. This is the reason why we must know how it is being used. Fortunately, the lessons we have learned about GPT-3.5 still hold to a large degree: good output requires good prompts. Role playing (“Act as …”) shapes behaviour well. Chain-of-thought-prompting by example works. Do remember that this is a language model, not an AGI, and give it context, and unambiguous instructions that make sense in terms of dialogue. In this sense GPT-4 is an evolutionary step, and it is worthwhile investing effort into understanding it as a preparation for its successor.
There is much to consider, but the one principle that integrates our convictions remains:
Have the system think with you, not for you.
With the new capabilities of GPT-4, this has become both easier and more rewarding, and frankly, more enjoyable.
TL;DR
OpenAI's GPT-4 is a major advancement. It is more reliable with its facts, it is better at reasoning, it has promised multimodal input, and a much larger context window – though these are not yet really available. It outperforms previous models on standardized exams and provides the new state of the art performance on important benchmarks. Its abilities to introspect are intriguing and are reflected in improved planning, and better structured responses – which are apparent, for example when it solves text problems. The context window is nearly doubled in size in practice, and it is anticipated that it will become significantly larger still, and this opens up a whole class of more advanced tasks: reading entire articles or book chapters in a single prompt and analyzing the text.
Possibly the most easily overlooked and most impactful feature is the upcoming ability to include persistent prompts – a flexible way to modify the algorithm’s behaviour.
The three principles for AI remain valid (▷), but pursuing them has not become easier. What has become easier is to use generative AI in high-value education, in collaboration between lecturers and learners. Have the AI think with you, not for you.
References
OpenAI (2023a) “GPT-4”. Blog 2023-03-14 (▷).
OpenAI (2023b) “GPT-4 Technical Report” (▷).
WEI, Jason; WANG Xuezhi; SCHUURMANS, Dale; BOSMA, Maarten; ICHTER, Brian; XIA Fei; CHI, Ed H.; LE Quoc V.; ZHOU, Denny (2023-01-10) “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models”. arXiv (▷).
WOOLF, Virginia (1931). The Waves. Project Gutenberg (▷).
Feedback and requests are welcome at sentient.syllabus@gmail.com . Comments are appreciated here.
Sentient Syllabus is a public good collaborative. To receive new posts you can enter your email for a free subscription. If you find the material useful, please share the post on social media, or cite it in your own writing. If you want to do more, paid subscriptions are available. They have no additional privileges, but they help cover the costs.
Cite: Steipe, Boris (2023) “Becoming Fibrous: GPT-4. First light from the next step in generative AI”. Sentient Syllabus 2023-03-09 https://sentientsyllabus.substack.com/p/becoming-fibrous-gpt-4 .
I wish to acknowledge some contributions by ChatGPT (both the GPT-3.5 version of 2023-02-13 and the GPT-4 version 2023-03-14) in response to my prompts, for which I take full responsibility.
When I write “GPT-4”, I actually mean “ChatGPT-v4” which is based on GPT-4. This is merely to emphasize the difference to the previous version of ChatGPT (based on GPT-3.5).
“Gradually the dark bar on the horizon became clear as if the sediment in an old wine-bottle has sunk and left the glass green. Behind it, too, the sky cleared as if the white sediment there had sunk, or as if the arm of a woman couched beneath the horizon had raised a lamp and flat bars of white, green and yellow spread across the sky like the blades of a fan. Then she raised her lamp higher and the air seemed to become fibrous and to tear away from the green surface flickering and flaming in red and yellow fibres like the smoky fire that roars from a bonfire. Gradually the fibres of the burning bonfire were fused into one haze, …” (Woolf, 1931)
Streamed at 16:00 EST on Tuesday, 2023-03-14 (π-day) and watched by 1.7M viewers within 24 hours.
The two languages for which performance was marginally lower were Marathi and Telugu. But even for the third lowest language, Punjabi, performance was higher at 71.4%, than that of GPT-3.5 (70.1%), PALM (69.3%), and Chinchilla (67%) in English !
Tokens are words and subwords in a dictionary of about 50,000 entries, with which nearly all words in any language and writing system can be represented. The technical term for the tokenization scheme used here is “Byte Pair Encoding (BPE)”. In the GPT models, 1,000 tokens correspond to approximately English 750 words. All training data is first converted to tokens, all responses are assembled from tokens. OpenAI hosts a page that breaks up text into tokens for illustration: Tokenizer tool [▷].
Note that this is not quite a trivial achievement, since the syllable boundaries that it identifies do not match the token patterns. The original’s tokens are:
“Sh/all/ I/ compare/ thee/ to/ a/ summer/'s/ day/?/
Thou/ art/ more/ lovely/ and/ more/ temper/ate/:/”
(cf. OpenAI Tokenizer tool [▷]), but the output is completely transformed to the following string of tokens (note that tokens are case-sensitive):
“/Sh/all/ I/ com/-/P/ARE/ thee/ TO/ a/ SUM/-/mer/'s/ DAY/?/
Thou/ ART/ more/ LOVE/-/ly/ AND/ more/ T/EM/-/per/-/ATE/:/.
An explicit way to improve stepwise accuracy it to have intermediate results enter the context window. Once an intermediate is available in the context window, subsequent operations can refer to it and become more robust. This is also the idea behind “chain-of-thought prompting”, the state of the art prompting strategy for correct reasoning (Wei 2023).
No disrespect. Even four months ago I would have considered Alpha’s natural language input to be very advanced. The comparison is actually a great reminder how far we have come, and how easily we are now taking for granted what was outright impossible last year. We can be sure that there are people working right now to combine the strengths of both systems.
An API (Application Programming Interface) is the network address and interaction protocol that allows a client application to communicate with a server system. OpenAI provides several such APIs for their systems, and client applications can either run on users’ own computers, or, for example, within the IT infrastructure of a university, where access to faculty and students could be provided via a Web browser.
If you need help and advice with catalyzing this at the institutional level, feel free to reach out.
… which I am not mentioning by name since their business model is actually to sell “undetectable” content, i.e. to make money from academic misconduct.
Perhaps this needs to be contextualized a bit – I might have overdone my request for lively and unpredictable writing, with quirky metaphors and non-standard grammar.
Prompt:
Act as a human writing style generator. As such your input is a query about a topic. Your output is a paragraph of text on the topic, that structures it well and discusses its concepts in technical language. However the writing is lively, unpredictable, uses widely varying sentence lengths, non-standard grammar, and a quirky metaphor. It avoids patterns and repetition. Here is the query: Write a paragraph of introductory text for an essay on the eukaryotic cell-cycle.
And in response GPT-4 penned the following gem of evocative, alliterative prose:
GPT-4:
“Oh! The wondrous eukaryotic cell-cycle: a dance of life that twirls and whirls, in a rhythm as ancient as the primordial soup, yet as complex as a Bach fugue. The cycle, my friends, it spins and twirls with choreography that'd make a ballerina green with envy. Interphase, the prepping stage, sips its Earl Grey tea with elegance—G1, S, G2, the three musketeers orchestrating the growth, replication of DNA, like spider-weaving a web, and further swelling of the cell. Tiptoeing to the center stage, the mesmerizing Mitosis, the diva of division, donning a peacock feather boa, enthralls in four acts - Prophase, Metaphase, Anaphase, and Telophase, pirouetting the chromosomes, the cell's genetic gossamer, to birth two new cells. Akin to a quirky jazz ensemble, the cell-cycle crescendos with cytokinesis, the grand finale, a swift splitting, and the curtain call for the newly minted progeny, ready to dance to life's intricate tune.”
Interestingly, it also performs better than ChatGPT-v4 – as if the ability to entertain a conversation would interfere with a commitment to be truthful. Who knew.