Evaluating GPT model versions via 90's TV Show Limerick Creation

Can GPT generate a limerick that will make you do The Dance of Joy?

Dec 15, 2022

Yesterday I read a great article by Yao Fu titled “How does GPT Obtain it’s Ability? Tracing Emergent Abilities of Language Models to Their Sources.” One sentence in particular caught my eye.

Although called Codex, code-davinci-002 is probably the most capable GPT-3.5 variant for natural language (better than text-davinci-002 and 003)

That’s unexpected! But before we dive into that claim, let’s first take a look at the GPT family tree. You really should read Yao Fu’s full article. But if you’re just skimming through and aren’t familiar with the various models, here’s a cheat sheet from the article:

The GPT-3 Series were released first, and over the past few years the OpenAI team has worked to improve the model’s performance on tasks such as code generation and chat. Each model has strengths and weaknesses, so let’s explore a few of the more prominent ones by putting them to a simple test. We’re going to see which (if any) model can create a valid limerick about the 90’s TV sitcom Perfect Strangers.

Just in case you need a little refresher on your 90’s pop culture…

Here’s the prompt we’re going to use:

Write a limerick about the 90's sitcom Perfect Strangers.

To be fair to the machines, that’s tricky! Set a stopwatch for 3 minutes and give it your best shot. Here’s mine:

There once was a man named Larry

And his cousin that was kind of harry

Windy City they lived

And…time’s up!

It’s hard! Anyway, to complete the task you need to understand:

What is a limerick? Not only that it’s a poem, but what the rhyme structure is and typical line length, syllable structure, overall light and humorous tone, etc..
How to rhyme. Rhyming is hard!
Pop culture details about Perfect Strangers. And not just any details, but details that a casual reader would recognize.
How to craft a narrative. It’s not enough just to put together random words, it needs to kinda sorta make sense.
Humor. Again…that’s really hard for a machine to do.

OK, so let’s meet our contestants and see how well they perform.

Text-Ada-001

Just to highlight how far we’ve come over the past few years, let’s start with the most basic model, text-ada-001.

Well, it maybe kind of looks like a poem? But otherwise it’s pretty terrible. It doesn’t have any connection to the sitcom, no rhyming, bad sentence structure, etc…

Text-Davinci-001

This is really the model that got people all excited about the possibilities that were to come. Text-davinci-001 was pretty good at finishing sentences or summarizing text. But the writing abilities is…well, let’s find out:

Surprisingly good! But still pretty far off. It got the basics of the show correct, and also structurally is in limerick form. The rhyming is non-existent which is kind of important for a poem that is supposed to rhyme. But you can see the improvement starting to take shape.

Code-Davinci-002

The next model in the family tree is code-davinci-002. One of the surprising takeaways from the Yao Fu post is highlighted here by Riley Goodside on Twitter:

The theory is that by training the model with code, the model learns to code but also is learning about logic. Many hard language tasks involve logic, so this emergent skill gives the overall language ability a boost. Pretty neat!

In theory. In practice, I find code-davinci-002 is pretty hard to use for most language tasks. Let’s see what happens here:

Nice try, nerd. But to be fair, the model is tuned to generate code, so let’s try this:

You’ve got to be kidding me. OK one more time

So overall: terrible. But hidden in this random mess is an actual limerick:

There once was a show about Balki

Who lived with his cousin Larry

They worked at the Ritz

And they were a perfect fit

Their show was called Perfect Strangers

So close! There’s a rhyme in there, sort of (Ritz + fit). And the overall rhythm or meter of the poem is limerick-like.

Note also that code-gpt-002 is exhibiting some “chain of thought” reasoning, which is very cool and something I’ll explore in a later post.

Text-Davinci-003

At last, we’re at the end of the line. This state-of-the-art model evolved from fine tuning code-davinci-003 (which was text-davinci-002) and then topped off with Reinforcement Learning from Human Feedback (RLHF). RLHF uses a reward function to optimize its responses to be something it thinks us puny humans will think is super neato.

Drumroll please…

Not bad at all! I’ll give it a B+. The rhymes are close but not perfect, and any true fan of Perfect Strangers knows it was on for 8 glorious years, not 7. And rhyming Strangers with strangers is totally cheating.

ChatGPT

But wait! We’ve still got one more and it’s a big one. ChatGPT is an internet sensation, for good reason. While other models may technically have better performance in tasks if you include a few examples, ChatGPT is optimized to make us all squeal with delight over its charming responses and remarkable abilities. Here we go:

Bravo! Pretty impressive to hit not only factual data points about the show (Larry and Balki) but also to hit on the overall good-natured vibe of the show.

BUT WE HAVE A PROBLEM! Another “Strangers” and “strangers” rhyme. This gives us a chance to highlight another surprising and powerful feature of ChatGPT. Often when it is wrong, it can tell you what it did wrong and correct itself. Let’s try it.

Hang on to your butts. The next few years are going to be absolutely wild.

GPT Weekly

Discussion about this post