We have all noticed the transformative impact of generative AI on all tech-related industries in the last few years, from coding to customer service. As a technical editor at Mindrift, I have the privilege of witnessing this change from the front seat. At Mindrift, we train generative AI models for a variety of tasks ranging from parsing code to interpreting images. It is this latter task – image processing, or 'visual recognition' – that I want to talk about.
Over the last few months I have spent working with Mindrift, teaching AI models to 'see' – teaching them to draw connections between parts of a chart, the text on a poster, or the objects in an image – has given me new insights on the workings of human intelligence.
What we do at Mindrift
To explain how my work at Mindrift changed my outlook on human intelligence, I need to go over what we do at Mindrift first. Large Language Models (LLMs) lie at the heart of generative AI. LLMs need to be trained on large sets of sample interactions to be able to understand and answer questions asked in daily language. My role as a technical editor in Mindrift's global team is mainly developing and curating conversations about various kinds of data, to be used in training LLMs.
Here's how it works: Depending on the exact needs of a given project, we are given some sample content (CSV files, for instance, or a set of infographics) and we write conversations between a human user and the AI agent discussing this content. After editing and QA (quality assurance), these conversations then serve as a template for AI models on how to interpret data and discuss it with the user in clear, accessible language.
Writing conversations around images
The structure and the content of the conversations used to train a given LLM greatly shape the resulting generative AI's performance and tone. As a result, how we structure our training conversations is really important.
Our decisions in this area (decisions about what kinds of interactions to train the LLM on) depend greatly on what kind of data is being processed.
When it comes to projects around 'visual recognition', we want our trainee LLM to analyse and understand all the individual elements in an image, and how the different elements of a given image fit together: It needs to be able to distinguish a dog from the grass it is lying on, for example. Additionally, the LLM needs to understand fundamental visual concepts like colour, shapes and perspective.
To make this concrete, here's an example of what a Mindrift conversation about an image might look like:
PROMPT: What is this dog doing? Can you guess anything about when this photo was taken?
RESPONSE: The dog is lying on the grass and its tongue is hanging out, perhaps because it is hot and the dog is trying to cool down. The grass is really healthy with a vibrant green colour and there is a lot of sunlight in the photo. There are flowers here and there. These details suggest that the photo was likely taken during spring or summer, when daylight is brightest and plants are at their most verdant.
PROMPT: How does this picture compare to the previous one?
RESPONSE: The other picture was a photograph, but this is a shot from a cartoon, where the dog is drawn in a stylised way. This dog is sitting, whereas the dog in the last picture was lying down. Instead of having its tongue out, this dog is holding a bone between its teeth. Instead of being outdoors, this dog is indoors.
Intuition and photos of dogs
As people, when we try to understand an image or draw conclusions from data, we have some level of intuition to draw from: We all know what a dog 'looks like', though it may take us a minute to precisely describe what a dog is. For an AI, however, we cannot count on such intuition (yet).
This is where Mindrift's datasets come in.
The conversations we create are designed to give AI models a general (even intuitive) understanding of what a dog is. We achieve this by creating data sets large enough that the LLMs trained on these conversations can understand the many different contexts in which we talk about dogs. One dialogue might be talking about a photo of a dog while another has a cartoon drawing of a dog, or perhaps the discussion is about an infographic on the sales of different dog food brands.
LLMs can compare all these different conversations and determine what we mean by 'dog' from the diverse ways we refer to the idea of a dog (in images and in conversation). This is why LLMs are so powerful: They can infer knowledge from connections between data. The key point is that images or conversations by themselves are not enough – understanding comes from being able to analyse how different images result in different conversations. To analyse that, you need to be able to compare and infer – this is what an LLM excels at.
As a consequence of this process, the AI model learns to distinguish a picture of a dog lying on grass from a picture of a sheep lying on grass, and to recognise the difference between a dog lying on grass and a dog sitting in a car. Two different users are talking about a dog in two different images – what do these images have in common? Where do they differ from each other? How do these differences change the ensuing conversations? With sufficient complexity and enough examples, questions like these form the backbone of the visual recognition capabilities of modern generative AI models. Our data supplies the answers. In other words, this is how we teach LLMs to 'see the dog'.
Artificial intuition and the nature of intelligence
At Mindrift, however, we do not only teach AI how to recognise objects in images, but to comment on them and how they relate to other components of a visual. From fundamental properties like lighting or perspective to subtle aspects like facial expressions, we discuss all parts of an image - we show the AI how to deduce complex information from these components. For instance, if there is a photo of a road with street signs, we might get the AI agent to talk about the text on the signs to guess at the location of the road in the photo.
In creating such conversations I inevitably found myself wondering how closely this process emulates the creation of our human visual intuition. As humans, we learn a lot by comparison and inference. The first time you saw a flag waving in the wind as a child, you probably didn't immediately know what it was. Yet, if you see a colourful, rectangular piece of fabric attached to a long post in front of a building today, you’ll probably think you are looking at a flag. Seeing a flag by itself is not enough – you need to see many flags in many places before you start to understand what a flag is.
This process parallels how LLMs are trained to develop visual recognition (by processing many conversations), and it is very natural to ask how deep the similarity goes. The necessity of not just processing images and discussions of images, but also of comparing those images and discussions to each other in developing LLM visual recognition, might tell us that our humane habits of comparing, contrasting and inferring are fundamental to the very notion of intelligence. In other words, perhaps generative AI models are not merely emulating human intuition, but developing intuition from scratch. It is not inconceivable that with enough depth and a large enough scale, a modern-day AI's visual understanding will mimic or even truly mirror our own.
Conclusion
How human intelligence comes into being is presently an open question (in cognitive science, philosophy, neurology and myriad other fields), but imbuing LLMs with the ability for visual recognition perhaps tells us one possible way in which our own intuition came about: comparison. My work at Mindrift has been instrumental in giving me this insight, precisely due to an AI's inability to intuit things.
Because AIs cannot intuit things on their own, we are training them to intuit, and my part in this training process has shifted how I view human visual (re)cognition and intuition.
Article by
Yigit Ozcelik