Informatics Institute

Published 10 November 2009

Focus on Research: computer scientist Cees Snoek

Photo: Bob Bronshoff

Trying to find a specific video clip on the internet can be a time-consuming affair. In many cases, images will not have been tagged with an adequate description. Computer scientist Cees Snoek is working to make video retrieval a lot easier. He is currently developing a search engine capable of recognising specific images. 'I'm working to translate pixels into text.'

Cees Snoek, a staff member at the Institute for Computer Science's Intelligent Systems Lab Amsterdam (ISLA), has spent the past few years developing a computer-based video recognition system. His efforts so far have certainly been successful: Snoek and his colleagues have consistently placed first in the annual competition in which all universities and major commercial players in the sector participate. So just how does their MediaMill video search engine work?

The fragments found by the MediaMill video search engine are presented to the user in a clear overview by means of the CrossBrowser.

Four thousand distinguishing features

In order to recognise an object or setting in a photograph or video, a computer needs to know what it is looking for. This is why novel video search engines need a large amount of learning examples. Snoek feeds the search engine a huge quantity of image fragments that can be linked to a specific search query. The search engine then assesses each image in terms of approximately 4000 distinguishing features, such as variations in colour, texture, and shape. Based on this analysis, the search engine will determine a characteristic correlation between the specific distinguishing features and the search query entered by the user.

The statistical model derived from this analysis, known as a concept detector, can then be utilised to search an enormous database for other images corresponding to this model. (Watch the video by clicking on the link below.)

Images that correlate with the model must then be presented to the user. This involves what is known as a CrossBrowser. The vertical axis shows the video fragments identified by the system, while the horizontal axis displays the timeline of a single video clip. This feature is extremely useful, as each video clip consists of a large amount of individual shots. When the image search engine finds a suitable result, the shots preceding and following this result also tend to match the search criteria.

How do we go about recognising a Ferrari? Do we focus on the colour red, the shiny texture, or does shape play a crucial role in the recognition process? Surprisingly enough, current video recognition software mainly operates on the basis of colour and textural aspects, while largely ignoring shape.

Snoek offers a demonstration. He enters the relatively simple search query ‘boat'. The programme successfully identifies a large number of boats from the enormous dataset. The search results also include an offshore drilling platform and a car negotiating a flooded road. 'As you can see, it has still made a few mistakes', Snoek admits. 'The software mainly focuses on texture and colour, whereas people tend to focus on shape. The software hasn't been developed to the point where it can apply this aspect as much as we'd like. On the whole, though, it is quite successful in picking out the right images.'

Photo: Bob Bronshoff

Search engine competitions

'The ability to search on the basis of images rather than having to depend on textual tags is incredibly useful', Snoek explains. This is clearly borne out by widespread interest in the problem: in addition to the ISLA search engine, some 50 teams from various research institutes, universities and companies are currently working to develop video search engines. There is even an annual competition.

Participants all use the same test set, for example an enormous quantity of video material from the Netherlands Institute for Sound and Vision archive. The objective is to carry out a specific search query as quickly and accurately as possible, such as identifying fragments that feature a kitchen. Snoek's approach works as follows. To begin with, he labels all shots from the set as either ‘kitchen' or ‘not kitchen'. He then divides the set into two parts, a test set and a training set. He uses the training set to identify the correlation between 4000 features characteristic of the image. This yields a concept detector, which is then applied to the test set. Finally, he verifies the model's accuracy: in which percentage of cases did the system actually identify kitchen rather than, say, a bathroom? He also assesses how often the computer failed to identify images of a kitchen.

The 101 Lexicon

A time-consuming task

The process of labelling images from a training set is incredibly labour-intensive. Snoek spent the mornings and evenings of an entire summer month labelling pictures, arriving at a total of 101 individual categories. He then assessed the correlation between the number of learning examples and the system's performance. As it turned out, the system had relatively little difficulty finding ‘boat' on the basis of a small number of examples. ‘Mobile phone', however, proved to be a lot more difficult. 'The main problem is the image background. A boat is basically a hole in the water. Mobile phones, on the other hand, can be used anywhere and are more difficult to identify.'

Snoek and his group have proven extremely successful in recognising images. However, Snoek admits their success in certain areas of the competition cannot be attributed solely to the quality of the software. 'If you've spent entire summers tagging video material, as I have, you become extremely adept at recognising images; you're trained for the job. In other words, your success in retrieval also depends on the person behind the keyboard.'

Photo-sharing websites

In view of the extremely time-consuming nature of photo tagging, Snoek and his team are constantly trying to find new ways of tagging images. One such method involves using photo-sharing websites such as Flickr, where users tag their own photos. Snoek will then run a computer check to verify whether the user tags are suitable for his purposes. He can then apply these images to teach the software to search more efficiently. However, there are also a number of other methods available online, such as Google's Image Labeler game, and an application known as the ESP game, in which the user and another player have to tag an image as quickly as they can. If both users apply the same description, they will be awarded points and progress to the next image. This is a fun way of coming up with accurate image descriptions.

Photo: Bob Bronshoff

However, this method of information exchange has its drawbacks. 'It is becoming increasingly clear to us that learning examples derived from one dataset are not necessarily effective when applied to another dataset. For example, definitions derived from consumer photos on Flickr are often difficult to apply to images from the Institute for Sound and Vision archive.' As of yet, Snoek has not been able to determine exactly why this is the case. 'It may have something to do with the fact that material taken from television is often filmed from the same camera positions, and lit in a specific way, while images on Flickr are much less consistent. In addition, the software tends to focus on the entire image, while humans tend to filter out a specific shape in the foreground. We're still working to find out exactly how this process works.'

Pinkpop

1 December will see the launch of a website showcasing the achievements of Snoek and his researchers. Visitors can search the site for images from Pinkpop music festival television broadcasts. 'Due to copyright issues, the search function will be limited to recordings of Dutch artists.' Users can provide feedback on the search engine's effectiveness, benefiting both parties: users can use the site to search for video material, while at the same time helping Snoek to improve his search engine. At the time of writing, it is still unclear how long it will take to develop the search engine for general use.

Author(s)


Source: Communicatie FNWI