Categories
Main

Stable Diffusion Pictionary with Bot Ross

Table of Contents

Intro

Before we dive into the project, we should at least provide a brief explanation of Stable Diffusion. Since it’s already been the subject of mass interweb hysteria for quite some time, I’m not going to go into any great detail.

Stable Diffusion is a generative art modal algorithm that was originally put forth by the Computer Vision research group at Ludwig Maximilian University. Why is this significant? Although there have been papers/algorithms published before, the crux of many of these systems is that a proper model has been trained and more importantly released to the public, thanks to Stability AI. This is in stark contrast to DALL-E, where OpenAI keeps a tight proprietary lid on any of their models.

So how good is stable diffusion, you ask? Feast your eyes on this stunning 4K render of Julie Andrews from beloved Disney classic, Mary Poppins, when you get too excited and accidentally misspell her last name as “Ploppins”:

Mary Ploppins doesn’t feed the birds, the bird *FEED* Mary Ploppins

Fantastic. Okay, but seriously, how about this one?

An RPG overworld map

As time as progressed, the open source community has coalesced around making Stable Diffusion more approachable and deployable even on machines with modestly powerful discrete GPUs.

Ever since seeing Midjourney, a popular AI generative art system fully accessible from Discord, I’d played around with the idea of having our own generative art system available in a similarly accessible manner, but with the added goal of building a game around it.

Stable diffusion and other models are particularly strong at taking existing concepts and applying different art styles to them.

realism, line drawing, cubism, and pop art

Here we have four different takes on the vulcan Spock in realism, line drawing, cubism, and pop art respectively.

This looks like a perfect application for the classic game of pictionary. Pictionary, for the uninitiated, is a game where one person draws a given cue card and the other players attempt to guess what it is.

Let’s get started. First things first, we need to decide how we’re going to get access to Stable Diffusion.

Stable Diffusion API

We have a couple options.

1. Make use of a SD as a service

The most popular service is probably Dream Studio.

Pros:

  • Simplicity
  • Has a Python API wrapper

Cons

  • Subject to “big AI” terms and conditions
  • Cost

2. Run on a dedicated ML platform

Paperspace is a platform that lets you spin up discrete GPU hardware (Teslas, etc).

Pros

  • Can run whatever model I want on it

Cons

  • Exposing a REST service through their servers
  • Cost again

3. Host it yourself

I finally went with door number 3, and decided to repurpose an old RTX 2060 linux laptop from a years ago to be our dedicated diffusion machine.

So we fetch the latest version of AUTOMATIC1111 , one of the most active and popular forks of Stable Diffusion. Bonus, it’s capable of running on relatively modest hardware, in my case, a 6GB vram NVIDIA card.

Unfortunately for us, it uses Gradio to generate a web interface which means all the endpoints are obscurely named.

Digging around for a bit, we find that TomJamesPearce put together a simple proof-of-concept API using uvicorn and fastapi on top of it. Not knowing if this was going to be extended or not and having more familiarity with javascript, I wrote a minimal node express server with super basic header-based authentication that would call this API.

Using this API, we can get back a base64 encoded image. Looking good, let’s move on to…

Discord Bot

Registration

First, let’s get our bot registered with Discord. As this was my first time building a discord bot, I was a bit apprehensive, but turns out it’s as simple as logging into the developer portal, registering an app and then adding a bot. Here’s our initial bot’s profile:

From there, we take our bot token, plop it into an environment variable, and bring in the discord.js npm package.

Discord bots are relatively simple. Register a set of / based commands, create a client and subscribe to the appropriate events and you’re off to the races.

Schema

We’ll need a set of “pictionary cards”, so let’s create a simple schema for a pack:

/**
* A Category pack
* @typedef {Object} Pack
* @property {string} name - The name of the pack
* @property {string[]} words - The words in the pack
* @property {string} description - The description of the pack
* @property {string[]} tags - List of words that are suffixed to a word in the pack
*/

What are the tags for, you ask? Some concepts are a little too general, so tags are there to add guidance to SD while at the same time not interfering with the words themselves. If we had a pack for Disney characters, doing a render for the word “Mickey Mouse” would be simple since its unambiguous, but what about this one:

That’s what happened when we generated the “Basil“, which is the name of the Disney mouse detective without any tags.

Here’s another render with the prompt Basil, Disney :

I’m going to pretend the marshmallow on a stick is a clue that he found.

That’s a little more guessable. So we separate the word from the tags.

So, a player types pictionary to start a round, we pick a random word, generate the image and wait for a player to type the correct response. Simple enough.

Styles

Now this is fine as a proof of concept and works well enough, but let’s leverage Stable Diffusion’s ability to render images in different styles. Furthermore, let’s add a random set of modifiers – stuff like highly realistic, intricate details, steampunk, etc

In the end, we landed on the following template:

[word] in the style of [artist], [medium], [modifiers]

Here’s an example of something generated:

Joe Pesci in the style of Caravaggio
Meerkat in the style of Leonardo da Vinci, stained glass

Play testing

Let’s give it a shot on Discord:

Sigh. It takes about 10-15 seconds to generate a 512×512 image at 30 steps or iterations. This is not terrible, but since it is a measurable amount of time, the players are just kind of twiddling their virtual thumbs until stable diffusion spits out our image. What if we could make this slowness an asset? That got me to thinking about college bowl trivia. In contract to your local trivia night, a college bowl question is often an entire paragraph consisting of highly specific clues which gradually become more general as the card is read. In this manner, it attempts to rewards players with a deep knowledge into the subject who can buzz earlier as the clue is read aloud.

For example: “Early in this novel, a rule is established that only the person holding a conch shell may talk during group meetings. Glasses belonging to (*) Piggy are broken by the choirboy Jack. School-aged children are stranded on an island in—for 10 points—name this novel by William Golding.”

So… what if we just started displaying the image immediately? Each time a step is generated, we’ll update the existing image in discord, giving the players an interactive slideshow which gets progressively more detailed as the round continues.

Daniel Crag -> Anderson Cooper

Success! Our overall logic looks like this:

And now we’re finally ready to play a few rounds!

Post mortem

Coherence

Originally, we had thought that using a constant seed would be better – since it would help to ensure a certain level of image consistency between successive iterations. Unfortunately, this has the side effect that if a specified seed isn’t particularly relevant to the prompt, it can make the experience frustrating for the users. Take a look at this image sequence.

Guess who this is supposed to be.

It’s like some unholy fusion of Oswald Cobblepot and Shrek. None of these iterative images seem to have any connection to the prompt. Care to guess what the original prompt was? If you get it, you could probably take out the Gamemaster as well.

Duh, it’s Shan You from Mulan, the resemblance is uncanny.

If we randomize the seed every iteration, it’s more likely that at least one of the generated images will bear a passing resemblance to our original prompt.

Extending Python API

In the end, all three components (discord bot, the node communicator, and stable diffusion) are running on the same machine. We should probably scrap the communicator and just work on making the python server more robust.

Adding Queueing

If we ever want to use this webapi SD for anything else, we’ll probably need to stick a queueing mechanism in front of it so that messages don’t just get rejected outright if it’s in the middle of a generation.

Deployment

We should really stick this entire setup into a docker image to make it a bit easier to deploy.

What’s left for Bot Ross?

There’s a lot of stuff that we could add – for one, it would be great to make this bot publicly available so that anyone could invite it to their servers, but it doesn’t really scale… to put it delicately.

We generate our pictionary images “on demand” for any given game, and since I don’t have a load balancer sitting in front of a couple dozen RTX builds just lying around the house, I would have to revise the design.

To build it in a scalable manner, we could go one of two routes.

Route 1: Continual pregeneration

Instead of having the discord bot request images in realtime, we’d have worker thread in our node Communicator who would generate a complete list “prompt permutations” and generate infinite images that would be uploaded to an S3 bucket stacked like cordwood. Then our communicator would parse an incoming prompt and fetch the “cached” image sequence from s3.

This feels… fine I guess, but also kind of a copout, and not as cool as a “realtime diffusion pictionary game”.

Route 2: Multi-game Sync

Our discord bot would take the first channel to “click” New Round, and kick off a sequence of generations. Any other subsequent guild/channel would effectively be joining that existing round, though they would be unaware of it.

This is fine but does come with the caveat that channels cannot pick specific category packs, since everybody is locked into the same one ongoing game.

How can I play it?

If you really really really want to play it, shoot me a message via Discord, wunderbaba#0001 and I’ll invite you to my guild to try it out.

All the code (such as it is) is fully available on my Github. It was written as a proof-of-concept, and is about as robust as Samuel L. Jackson’s character from the Unbreakable movie.

Credits

This project would not have been possible without the tireless efforts of the AI open-source community at large. Big shoutout to AUTOMATIC1111 who currently hosts one of the best Gradio UI/UX frontends for running Stable Diffusion.

Categories
Technical

Strangest Things

At this point, a lot of people have probably seen the Cthulhu-style dread horror series Stranger Things. If not, that is Lemon Grab unacceptable but I’ll let it slide for now. You will still be able to follow the gist of this blog post.

As I was catching up on the second season of Stranger Things, there was a brief scene involving Sean Astin’s character, who as an actor is probably one of the most quintessentially 80’s people still in existence [1]. Astin needed to regain control of the security system of a building in order to unlock the entrances. To do so, he had to brute force a four-digit PIN password… by writing a computer program in BASIC.

For readers unfamiliar with BASIC, it was effectively a high-level programming language that was ported in various degrees of standardization to almost every possible computer system imaginable, from Commodores to IBMs to Apples. Like Prince of Persia, you could find BASIC everywhere – my personal favorite version was BBQ-BASIC, which is what George Foremen grill’s firmware was written in. The idea was that you could *theoretically* write a single BASIC program with a high level of cross-platform support.

Naturally, I paused the show and immediately took a screenshot with what looked like the finished BASIC program.

By TV and film standards, this is relatively legit BASIC code simulating a brute-force attack, though there’s a bit of handwaving going on here. You can see it checks the password against a subroutine called checkPasswordMatch but it’s not really defined anywhere. Additionally, since FourDigitPassword has been dimensioned as an integer it won’t be padded correctly, aka, getFourDigits (0, 0, 0, 1) would cast to 1.

Here’s a recreation of the code with some minor modifications to allow it to run reasonably correctly.

Astute observers will note the dollar sign suffixed to the variable A. The Microsoft variants of BASIC used a dollar sign to indicate that a variable’s data class was a string, otherwise, it would default to numerical.

There’s also a delay placed in the innermost loop since on any halfway modern machine this code is going to execute instantaneously. We can’t really use the sleep command, since it only takes an integer representing the number of seconds. The solution is either to add an arbitrarily long waiting for-loop, or build the program as an Electron app. (BURN BURN BURN TO THE GROUND)

And finally, this is what the program looks like when executed:

Now normally, you might question BASIC as a practical language for purposes of pen-testing, but as can be seen from this short CSI clip, I think it’s fairly evident that BASIC has always had a rich history of being employed by hackers.

The first thing you learn as a software engineer is that a button’s functionality is directly determined by its label.

How did you learn BASIC?

Like most people, I learned how to program in BASIC from a mustachioed machete-wielding British gentleman whilst on safari. If I remember my history correctly, it was Livingston that first sighted a copy of QBasic in the wild during his expeditions through the dark continent of Africa.

And I don’t want to brag, but I had a copy of Microsoft BASIC PDS 7.1 (Professional Development System) and that was major nerd bling with its OS/2 compatibility. I still list it on my CV under skills alongside my D&D character level and my high score in Galaga.

How do I get started?

The fastest way to get setup is to download QB64 available for Windows, Linux, and Mac. Be aware that it’s not entirely faithful to the original, since QB64 produces compiled executables and does not (to my knowledge) support the ability to run line-by-line in an interpretative fashion – which is what made QB seem so incredibly magical; it was an editor, compiler, documentation and sandboxed environment all at once. Worry not however, all your favorite command statements are still there (I’m looking at you BEEP). Alternatively, if you’re willing to do a little more legwork, you could also install QuickBasic 4.5 using DosBox.

[1] With the exception of RGB2 from Regular Show who could only survive by breathing cans of air from the 1980s.

Categories
Main

Specular Realms founded.

With blood, sweat and tears, this company is finally off the ground.

Now let the money start rolling in baby because I have the combined business acumen of the underpants gnomes combined with Sam the Eagle.

Sam the Eagle