Segment Anything Model

Look, SAM! I see everything!

Facebook (I can never get around to saying Meta unironically) finally descended from its castle in the Metaverse and entered the AI discourse. And boy did they do it in style.

SAM

FB released a paper describing their new Segment Anything Model or SAM. This is an image segmentation model, i.e. given an image as input it detects every distinct object in the image and then produces a mask which highlights the different objects in the image with different colors.

Wait, don’t we have such models already?

Not really. Historically segmentation models are trained to detect certain types of objects and they do a decent job of it, but only for those objects. E.g. person/face detection models are in common use and 100s of bespoke models that detect anything from cars to tumors exist. Models trained to segment anything and everything in an image, however, don’t work as well.

The really impressive thing about SAM is that it is zero-shot segmentation, i.e. it is not specific to an object type but segments any and every object in the image. SAM beats not only other general models, but even the bespoke models that segment specific objects. The Segment Anything Model is built using what the authors call “promptable segmentation” which the authors declare as the secret for the models success. Let’s unpack this more.

Intuition 1 : Text is not the only prompt

If you are familiar with ChatGPT or Stable Diffusion, you know these models produce their output when prompted by a text input.

The SAM model can be prompted in 4 different ways - using a 1) pointer, 2) bounding box, 3) text or 4) mask.

The pointer prompt is a point that marks a spot in the image and the model’s task is to segment the object that contains this point. Similarly, the box prompt is a bounding box containing an object and the model has to segment that object. Text is a prompt that describes the object the model should segment. Mask is the dense prompt that requires the model to segment all the objects inside the prompt mask.

This paper shows how any constraint can be used as a prompt, I am sure we are going to see a large variety of prompt types in the future beyond just text. For convenience I am going to use pointers in the rest of the post, but the same logic applies to other prompt types.

Intuition 2 : Man+Machine, Labels Extreme

The authors start with a dataset of images which was annotated with around 20 masks per image and trained a model which on average generated 100 masks per image. How can a model be trained to surpass its training data? In 3 easy steps -

Step 1 : First they trained versions of SAM on both public segmentation datasets and a small dataset (120k images) they collected with human annotated masks.

Step 2 : Next, they collected another small set of images (180k images). This they annotated first using the SAM trained in Step 1 to get all the confident masks out of the way. Then they asked human labelers to annotate the ambiguous and low confidence objects that were missed by SAM.

Step 3 : The final version of SAM was trained on the combined dataset from step 1 and 2. Due to this final version of SAM being a) trained on this mix of labels including low confidence objects, and b) using some additional prompting techniques during inference, it vastly outperformed all previous versions. They tested this on a large dataset of 11M images and found that it could return on average 100 masks per image, despite humans only labeling approximately 20 masks per image on less than 4% of the data.

Now, if you are paying attention, you would notice I snuck an “additional prompting technique” in there. Well, you have a good eye, because that is the most important intuition to understand.

Intuition 3 : Promptable models enable better promptable models

SAM is more powerful than any other segmentation model because it is promptable. In my earlier post “Infusing imagination into intelligence” I discuss how text prompts make text-to-image models imaginative. But segmentation is not an imaginative task, here we value accuracy and coverage. So how does that work?

During the training of SAM, the authors use the human-annotated labels to create prompts. For example, they can pick out a random point from a human annotated mask and use that as a pointer prompt. This trains the model to detect the object where the pointer lies. When using SAM to make predictions on new images, the trick is to apply a special prompt. The special prompt is simply a grid of pointers covering the whole image. So the model tries to find objects throughout the image and returns an object for each pointer. Then they do a post-processing where they simply merge all the overlapping masks, thus if two pointers lie in the same objects, their segmentation masks are merged, and voila, you are left with a set of distinct masks for every single object in the image.

Let’s summarize the above magic trick to make it clearer.

What you want : a model that segments any and every object in an image

What you need : large dataset with images that have every object segmented

What you have : small dataset of images with some objects segmented

Trick

The Pledge : find a prompt that is easy to generate e.g. pointers

The Turn : during training, generate pointer prompts that are derived from the labels and train the model to detect objects that have a pointer prompt on it.

The Prestige : during inference, put lots of pointer prompts all over the input image; the model will automatically detect every object in the image, because it is trained to detect any object that has a pointer in it.

Beautiful!

My comments and conjectures

I am totally floored by the elegance and power of prompting demonstrated in this paper. Deep learning models learn a lot of implicit information during training, but it is not easy to access this information. Prompting gives the builders and users of the model a way to guide the model and access all of its capabilities.

A ‘normie’ segmentation model can already detect objects. But without a way to guide it, it just does what it sees most often in the training data. Therefore, increasing model capability becomes the difficult challenge of creating a large dataset labeled with the exact task we want the model to do. Such models are also brittle since they can only do one task. Prompting changes that, it gives us control over the model’s action, without changing the training data.

SAM shows that prompting is more general than previously thought. Firstly, anything can be a prompt. Secondly, different types of prompts confer different powers to the model. The vast complexity, interrelationships and ambiguity of language prompts make models able to extrapolate concepts which is fantastic for imaginative tasks like image generation but not for other tasks that require precision. In contrast, the easy replication and precise location of pointers can imbue a segmentation model with granular coverage and precision. We can imagine that other more complex prompts can capture more complex concepts. And I bet that is how the human brain stores and retrieves concepts.

The smell of rain is technically known as “Petrichor”. Weird totally not fitting name.

Consider, for example, how a song or a smell can become associated with a complex memory or time of our lives. You might go years or decades without ever thinking about that memory, but catching that smell again or hearing that song triggers the memory and you remember it like it was yesterday. Such associations can also be negative, for example PTSD can cause patients to associate loud noises with pain or trauma even when they are no longer in the war zone. That smell, the song, loud noises - these are all prompts that our brain learnt to associate with a complex set of emotions and memories. As a result, when we are deciding our next action, we don’t just do what we always do. The ever-changing environment, our emotional/hormonal states, recent events and even the time of day - all act as a complex web of prompts that push our memories, emotions and actions in new interesting directions. This irreducible complexity of prompts is at least partly responsible for our creativity, imagination and free will. I conjecture that with the recent advances in prompting (or steerable conditional inference), we have unlocked an important piece in the fascinating puzzle that is intelligence.

————————————————————————————————————
There you have it, my intuitions on how the new Segment Anything paper demonstrates the power of prompting to unlock the capabilities of deep learning models and how it might be an important piece of the intelligence puzzle. For more such intuitions on AI/ML, subscribe and follow me on Twitter. You can also check out my other projects on nirsd.com.