3 – 03 Captions And The COCO Dataset V3

The first thing to know about image captioning model is how it will train. Your model will learn from a dataset composed of images, pair with captions that describe the content of those images. Say you’re asked to write a caption that describe this image, how would you approach this task? First, you might look at the image and take notes of a bunch of different objects like different people and kites and the blue sky. Then based on how these objects are placed in an image and their relationship to each other, you might think that these people are flying kites. They’re in this big grassy area, so they may also be in a park. After collecting these visual observations, you could put together a phrase that describes the image as, “People flying kites in a park”. You use a combination of spatial observation and sequential text descriptions to write a caption, and this is exactly the kind of flow we’ll aim to create in a captioning model that uses CNN and RNN architectures. A common dataset that is used in training and captioning model is that COCO dataset. COCO stands for common objects and contexts and it contains a large variety of images. Each image has a set of about five associated captions, and you can see a few examples of those captions here. Next, you’ll get a chance to explore this data on your own.

%d 블로거가 이것을 좋아합니다: