Before reading this post, I would suggest the reader to have a look at Introduction to automatic image captioning post if they are not familiar with how image captioning works. Now assuming you are comfortable with how captioning works, let’s discuss the second part - Caption matching.
To find relevant documents for a user given query has been studied for a long time and the techniques used to solve the problem is broadly covered by text mining.
Since we are dealing with vectors here, we need to convert our captions back to vector somehow. an astute reader would question the purpose of converting the image vector to captions if in the end we need vectors. This is because the size of the vector is fixed by our choice of CNN architecture but if we want, we can increase the vocabulary and can capture similar images more precisely. Furthermore, we can use boolean space model to discard any captions that cannot possibily match the generated query caption. This is simple and fast and discards lots of junk. Once we have list of matching captions, we need to rank them by relevance.
We use the combination of tf-idf and field length norm to evaluate the weight of a single term in a particular document. Let’s discuss about them.
Term Frequency (tf): How often does the term appear in this document? The more often, the higher the weight.
Inverse Document Frequency (idf): How often does the term appear in all documents in the collection? The more often, the lower the weight
Field length norm: How long is the field? The shorter the field, the higher the weight. If a term appears in a short field, such as a title field, it is more likely that the content of that field is about the term than if the same term appears in a much bigger body field.
Multiplying these three, we get the weight of the term considered. We calculate and store weights for each term in the generated caption. Now to compare the multiterm query, we use vector space model. For the sake of simplicity, we can consider a vector to be one-dimensional array containing numbers.
For each caption in the list of matching captions, if the caption contains a term from the query caption, That term’s weight is initialized. For example, if a caption present in our database is “A Child playing football” and the query is “man playing football”, the vector representation of the considered caption would be:
We repeat this procedure for all the matching captions to generate the vector representations. Then we use cosine similarity to generate the similarity score.
Check out sethuiyer/Image-to-Image-search for the implementation of the reverse image search engine. The following slide presents the overview of these two blog posts.