This is an introductory post to understand how Automatic Image Captioning works. The reader is not assumed to be proficient with machine learning. This post describes one of the method used to generate image captions.
One of the most fascinating and challenging topics of Machine learning is Image Captioning problem. Given an image, we want to automatically generate a natural language description of the image and it’s regions. This task is simple for humans, as humans are able to capture richer interpretations of the images but it’s not simple for computers as computers only understand the binary and they “see” the image as a pixel matrix.
As computers understand only numbers, We have to use computations to make the computers understand the images. Computational models is an experimental device that can help us to understand the world we live in through computations and also give us ability to predict the future with some confidence.
Using the computational models, we can represent image as well as the language as bunch of numbers. For example:
when given as input to the computational model, let us assume it outputs . Also, when we give as input, say it outputs .
Notice while representing the image as a bunch of numbers, it didn’t lose it’s original meaning. If the images are similar, their numerical representation is also similar. In the same way, when words are represented as bunch of numbers, The bunch of numbers which is representation of the word ‘king’ should be very similar to the bunch of numbers which is the representation of the word ‘prince’. Also, if there are two images which shows a common activity say eating, their representation should be similar.
All the words in the vocabulary should be then converted to such bunch of numbers so that computers understand the language semantically. From a computer science perspective, this bunch of numbers or more formally, the list of numbers is called as vectors.
Mathematics enables us to see the vectors from a geometrical perspective. For example, if and , we can then fix a point called the origin and we can then visualize and as follows:
With this new perspective, we can do more things with the vectors. Linear algebra is the branch of mathematics that deals with vectors and their geometrical properties. It is one of the many branches of mathematics machine learning is largely dependent on.
Now, similarity is abstract. Linear algebra gives us techniques to quantify the similarity, eliminating the abstractedness. Once similarity is quantified, we can develop algorithms which can answer these questions
How this helps us in understanding captioning? First let us focus on the simple task which is generating single word caption.
Suppose we have an image of a cat. After we convert the image to vector, we can ask this question.
Which words in my vocabulary has the vector representations closest to this vector?
If our process of converting the image and words to vectors is perfect, we know the vector of word cat and the vector of the image of a cat must be nearby and hence, the computer outputs the word cat. Now, how to generate multiple word captions?
In order to generate multiple word captions, the caption should be grammatically correct. But how can we make computer understand the grammar of a language? One approach is to use a probabilistic model. This model depends on the Distributional Hypothesis which states that words appearing in same context share semantic meaning.
Using this core hypothesis, one can make probabilistic predictions of the next word given preceding ones.
Example: Given the model has learnt from a large corpus of text, and the vocabulary of the model is [‘I’,’START’,’WATER’,’STOP’,’DRINK’], where START and STOP are special symbols which denote start and end of a sentence respectively.
If it has learned the distribution of the words correctly, it will generate START I DRINK WATER STOP sequentially because of the distributional hypothesis.
How can we make a computer learn the distribution of words?
As a human being, In kindergarten we were shown a picture of cat and we were told that this object which is shown is called a cat. We then developed richer interpretations of the image and whenever we see a object similar to previously shown image, we infer the object as cat.
This approach of learning by examples is called supervised learning in machine learning. This approach is data driven. The more the data, the more we can learn to distinguish between cats and non cats provided the way we learn is good enough. Learning for computers is also an computational processes involving variety of fields in mathematics such as optimization, multi dimensional calculus etc.
To summarize, when we have a good computational model and good dataset with labels, their vector representation becomes accurate. If their vector representation is accurate, and we have good language model, we can generate words probabilistically which depends on both the image given as well as our language vocabulary.
So, in this way, we generate captions of the image using computational models. To make the concepts even more clear, I recommend watching this video.
Hope you enjoyed reading this post
Edit: Part 2 : Image search Engine using Image Captioning