Categories
Main

Stable Diffusion Pictionary with Bot Ross

Table of Contents

Introduction


Before we dive into the project, we should at least provide a brief explanation of Stable Diffusion. Since it’s already been the subject of mass interweb hysteria for quite some time, I’m not going to go into any great detail.

Stable Diffusion is a generative art modal algorithm that was originally put forth by the Computer Vision research group at Ludwig Maximilian University. Why is this significant? Although there have been papers concerning generative art system papers published prior to Stable Diffusion, the key difference is that a proper model has not only been trained but more importantly freely released to the public thanks to Stability AI. This is in stark contrast to DALL-E, where OpenAI keeps a tight proprietary lid on any of their models.

So how good is stable diffusion, you ask? Feast your eyes on this stunning 4K render of Julie Andrews from beloved Disney classic, Mary Poppins, when you get too excited and accidentally misspell her last name as “Ploppins”:

Mary Ploppins doesn’t feed the birds, the birds *FEED* Mary Ploppins

Fantastic. Okay, but seriously, how about this one?

An RPG overworld map

As time as progressed, the open source community has coalesced around making Stable Diffusion more approachable and deployable even on machines with modestly powerful discrete GPUs.

Ever since seeing Midjourney, a popular AI generative art system fully accessible from Discord, I’d played around with the idea of having our own generative art system available in a similarly accessible manner, but with the added goal of building a game around it.

Stable diffusion and other models are particularly strong at taking existing concepts and applying different art styles to them.

realism, line drawing, cubism, and pop art

Here we have four different takes on the vulcan Spock in realism, line drawing, cubism, and pop art respectively.

This looks like a perfect application for the classic game of pictionary. Pictionary, for the uninitiated, is a game where one person draws a given cue card and the other players attempt to guess what it is.

Let’s get started. First things first, we need to decide how we’re going to get access to Stable Diffusion.

Stable Diffusion API


We have a couple options.

Make use of a SD as a service

The most popular service is probably Dream Studio.

Pros:

  • Simplicity
  • Has a Python API wrapper

Cons

  • Subject to “big AI” terms and conditions
  • Cost

Run on a dedicated ML platform

Paperspace is a platform that lets you spin up discrete GPU hardware (Teslas, etc).

Pros

  • Can run whatever model I want on it

Cons

  • Exposing a REST service through their servers
  • Cost again

Host it ourselves

I finally went with door number 3, and decided to repurpose an old RTX 2060 linux laptop from a years ago to be our dedicated diffusion machine.

So we fetch the latest version of AUTOMATIC1111 , one of the most active and popular forks of Stable Diffusion. Bonus, it’s capable of running on relatively modest hardware, in my case, a 6GB vram NVIDIA card.

Unfortunately for us, it uses Gradio to generate a web interface which means all the endpoints are obscurely named.

Digging around for a bit, we find that TomJamesPearce put together a simple proof-of-concept API using uvicorn and fastapi on top of it. Not knowing if this was going to be extended or not and having more familiarity with javascript, I wrote a minimal node express server with super basic header-based authentication that would call this API.

Using this API, we can get back a base64 encoded image. Looking good, let’s move on to the…

Discord Bot


Registration

First, let’s get our bot registered with Discord. As this was my first time building a discord bot, I was a bit apprehensive, but turns out it’s as simple as logging into the developer portal, registering an app and then adding a bot. Here’s our initial bot’s profile:

From there, we take our bot token, plop it into an environment variable, and bring in the discord.js npm package.

Discord bots are relatively simple. Register a set of / based commands, create a client and subscribe to the appropriate events and you’re off to the races.

Schema

We’ll need a set of “pictionary cards”, so let’s create a simple schema for a pack:

/**
 * A Category pack
 * @typedef {Object} Pack
 * @property {string} name - pack name
 * @property {string[]} words - words in the pack
 * @property {string} description - pack description
 * @property {string[]} tags - Suffixes to a word
 */

What are the tags for? Some concepts are a little too general, so tags are there to add guidance to SD while at the same time not interfering with the words themselves. If we had a pack for Disney characters, doing a render for the word “Mickey Mouse” would be simple since its unambiguous, but what about this one:

That’s what happened when we generated the “Basil“, which is the name of the Disney mouse detective without any tags.

Here’s another render with the prompt Basil, Disney :

I’m going to pretend the marshmallow on a stick is a clue that he found.

That’s a little more guessable. So we separate the word from the tags.

So, a player types pictionary to start a round, we pick a random word, generate the image and wait for a player to type the correct response. Simple enough.

Styles

Now this is fine as a proof of concept and works well enough, but let’s leverage Stable Diffusion’s ability to render images in different styles. Furthermore, let’s add a random set of modifiers – stuff like highly realistic, intricate details, steampunk, etc

In the end, we landed on the following template:

[word] in the style of [artist], [medium], [modifiers]

Here’s an example of something generated:

Joe Pesci in the style of Caravaggio
Meerkat in the style of Leonardo da Vinci, stained glass

Play testing


Let’s give it a shot on Discord:

Sigh. It takes about 10-15 seconds to generate a 512×512 image at 30 steps or iterations. This is not terrible, but since it is a measurable amount of time, the players are just kind of twiddling their virtual thumbs until stable diffusion spits out our image. What if we could make this slowness an asset? That got me to thinking about college bowl trivia. In contract to your local trivia night, a college bowl question is often an entire paragraph consisting of highly specific clues which gradually become more general as the card is read. In this manner, it attempts to rewards players with a deep knowledge into the subject who can buzz earlier as the clue is read aloud.

For example: “Early in this novel, a rule is established that only the person holding a conch shell may talk during group meetings. Glasses belonging to (*) Piggy are broken by the choirboy Jack. School-aged children are stranded on an island in—for 10 points—name this novel by William Golding.”

So… what if we just started displaying the image immediately? Each time a step is generated, we’ll update the existing image in discord, giving the players an interactive slideshow which gets progressively more detailed as the round continues.

Daniel Crag -> Anderson Cooper

Success! Our overall logic looks like this:

And now we’re finally ready to play a few rounds!

Post mortem


Coherence

Originally, we had thought that using a constant seed would be better – since it would help to ensure a certain level of image consistency between successive iterations. Unfortunately, this has the side effect that if a specified seed isn’t particularly relevant to the prompt, it can make the experience frustrating for the users. Take a look at this image sequence.

Guess who this is supposed to be.

It’s like some unholy fusion of Oswald Cobblepot and Shrek. None of these iterative images seem to have any connection to the prompt. Care to guess what the original prompt was? If you get it, you could probably take out the Gamemaster as well.

Duh, it’s Shan You from Mulan, the resemblance is uncanny.

If we randomize the seed every iteration, it’s more likely that at least one of the generated images will bear a passing resemblance to our original prompt.

Extending Python API

In the end, all three components (discord bot, the node communicator, and stable diffusion) are running on the same machine. We should probably scrap the communicator and just work on making the python server more robust.

Adding Queueing

If we ever want to use this webapi SD for anything else, we’ll probably need to stick a queueing mechanism in front of it so that messages don’t just get rejected outright if it’s in the middle of a generation.

Deployment

We should really stick this entire setup into a docker image to make it a bit easier to deploy.

What’s left for Bot Ross?


There’s a lot of stuff that we could add – for one, it would be great to make this bot publicly available so that anyone could invite it to their servers, but it doesn’t really scale… to put it delicately.

We generate our pictionary images “on demand” for any given game, and since I don’t have a load balancer sitting in front of a couple dozen RTX builds just lying around the house, I would have to revise the design.

To build it in a scalable manner, we could go one of two routes.

Route 1: Continual pregeneration

Instead of having the discord bot request images in realtime, we’d have worker thread in our node Communicator who would generate a complete list “prompt permutations” and generate infinite images that would be uploaded to an S3 bucket stacked like cordwood. Then our communicator would parse an incoming prompt and fetch the “cached” image sequence from s3.

This feels… fine I guess, but also kind of a copout, and not as cool as a “realtime diffusion pictionary game”.

Route 2: Multi-game Sync

Our discord bot would take the first channel to “click” New Round, and kick off a sequence of generations. Any other subsequent guild/channel would effectively be joining that existing round, though they would be unaware of it.

This is fine but does come with the caveat that channels cannot pick specific category packs, since everybody is locked into the same one ongoing game.

How can I play it?


If you really really really want to play it, shoot me a message via Discord, wunderbaba#0001 and I’ll invite you to my guild to try it out.

All the code (such as it is) is fully available on my Github. It was written as a proof-of-concept, and is about as robust as Samuel L. Jackson’s character from the Unbreakable movie.

Credits


This project would not have been possible without the tireless efforts of the AI open-source community at large. Big shoutout to AUTOMATIC1111 who currently hosts one of the best Gradio UI/UX frontends for running Stable Diffusion.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.