Four-panel Grep and Grok cartoon about reading a hot dog sign as if it meant a dog.
Text in image is not image understanding.

Episode

Image Recognition

Grep reads the visible word and follows it literally. Grok looks at the stand, the bun, the menu, and the situation.

The point is that reading text in an image is not the same as understanding an image.

Grep
Reads the visible word.
Grok
Reads the scene around the word.