Open Annotations

High-quality Computer Vision Open Datasets


The image dataset for new algorithms is organised according to the WordNet hierarchy, in which each node of the hierarchy is depicted by hundreds and thousands of images.

Downloading datasets isn’t instant though, you’ll have to register on the site, hover over the ‘download’ menu dropdown, then select ‘original images’. Provided you’re using the datasets for educational/personal use, you can submit for access to download the original images.

ImageNet is also currently running a competition on Kaggle — check it out here.

About ImageNet

Welcome to the ImageNet project! ImageNet is an ongoing research effort to provide researchers around the world an easily accessible image database. On this page, you will find some useful information about the database, the ImageNet community, and the background of this project. Please feel free to contact us if you have comments or questions. We’d love to hear from researchers on ideas to improve ImageNet.

What is ImageNet?

ImageNet is an image dataset organized according to the WordNet hierarchy. Each meaningful concept in WordNet, possibly described by multiple words or word phrases, is called a “synonym set” or “synset”. There are more than 100,000 synsets in WordNet, majority of them are nouns (80,000+). In ImageNet, we aim to provide on average 1000 images to illustrate each synset. Images of each concept are quality-controlled and human-annotated. In its completion, we hope ImageNet will offer tens of millions of cleanly sorted images for most of the concepts in the WordNet hierarchy.

Category: Image Processing


This is one of the important databases for deep learning. Microsoft and Google lab researchers have reportedly contributed to this dataset of handwritten digits. It is basically constructed from NIST that contains binary images of handwritten digits.

The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image.

It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting.

The MNIST database was constructed from NIST’s Special Database 3 and Special Database 1 which contain binary images of handwritten digits. NIST originally designated SD-3 as their training set and SD-1 as their test set. However, SD-3 is much cleaner and easier to recognize than SD-1. The reason for this can be found on the fact that SD-3 was collected among Census Bureau employees, while SD-1 was collected among high-school students. Drawing sensible conclusions from learning experiments requires that the result be independent of the choice of training set and test among the complete set of samples. Therefore it was necessary to build a new database by mixing NIST’s datasets.

The MNIST training set is composed of 30,000 patterns from SD-3 and 30,000 patterns from SD-1. Our test set was composed of 5,000 patterns from SD-3 and 5,000 patterns from SD-1. The 60,000 pattern training set contained examples from approximately 250 writers. We made sure that the sets of writers of the training set and test set were disjoint.

Category: Image Processing

COCO (Common Object in Context)

COCO is a large-scale object detection, segmentation, and captioning dataset. COCO has several features:

  • Object segmentation
  • Recognition in context
  • Superpixel stuff segmentation
  • 330K images (>200K labeled)
  • 1.5 million object instances
  • 80 object categories
  • 91 stuff categories
  • 5 captions per image
  • 250,000 people with keypoints

Category: Image Processing


The YouTube-8M was released in June 2019 with segment-level annotations. YouTube-8M Segments dataset is an extension of the YouTube-8M dataset with human-verified segment annotations. In addition to annotating videos, we would like to temporally localize the entities in the videos, i.e., find out when the entities occur.

It consists of human-verified labels on about 237K segments on 1000 classes from the validation set of the YouTube-8M dataset. Each video will again come with time-localized frame-level features so classifier predictions can be made at segment-level granularity.

Youtube 8M

Category: Image Processing

Yelp Reviews

The Yelp dataset is a subset of our businesses, reviews, and user data for use in personal, educational, and academic purposes. Available as JSON files, use it to teach students about databases, to learn NLP, or for sample production data while you learn how to make mobile apps.

The dataset contains 1,223,094 tips by 1,637,138 users. Over 1.2 million business attributes like hours, parking, availability, and ambience. Aggregated check-ins over time for each of the 192,609 businesses

Yelp Reviews Dataset

Category: Natural language processing


LibriSpeech ASR corpus is a large-scale (1000 hours) corpus of read English speech that is derived from reading audiobooks from the LibriVox project. It contains prepared language-model training data and pre-built language models.

The dataset contains 1000 hours of speech sampled at 16 kHz that includes recordings of yes-no, Danish pronunciation dictionary, the large-scale corpus of English speech, list of words in Spanish, recordings of African Accented French speech, etc.

LibriSpeech Open SLR

Category: Natural language processing


The CIFAR-10 dataset consists of 60000 32×32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.


Category: Image Classification

Open Images

Open Images is a dataset of ~9M images annotated with image-level labels, object bounding boxes, object segmentation masks, and visual relationships. It contains a total of 16M bounding boxes for 600 object classes on 1.9M images, making it the largest existing dataset with object location annotations.

The boxes have been largely manually drawn by professional annotators to ensure accuracy and consistency. The images are very diverse and often contain complex scenes with several objects (8.3 per image on average). Open Images also offers visual relationship annotations, indicating pairs of objects in particular relations (e.g. “woman playing guitar”, “beer on table”). In total it has 329 relationship triplets with 391,073 samples. In V5 we added segmentation masks for 2.8M object instances in 350 classes. Segmentation masks mark the outline of objects, which characterizes their spatial extent to a much higher level of detail. Finally, the dataset is annotated with 36.5M image-level labels spanning 19,969 classes.

Open Images Dataset

Category: Image Classification


A large-scale, high-quality dataset of URL links to approximately 650,000 video clips that covers 700 human action classes, including human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands and hugging. Each action class has at least 600 video clips. Each clip is human annotated with a single action class and lasts around 10s.

Kinetics 700 – Paper


Category: Image Classification


A new large-scale dataset that contains a diverse set of stereo video sequences recorded in street scenes from 50 different cities, with high quality pixel-level annotations of 5 000 frames in addition to a larger set of 20 000 weakly annotated frames. The dataset is thus an order of magnitude larger than similar previous attempts. Details on annotated classes and examples of our annotations are available at this webpage.

The Cityscapes Dataset is intended for

  1. assessing the performance of vision algorithms for major tasks of semantic urban scene understanding: pixel-level, instance-level, and panoptic semantic labeling;
  2. supporting research that aims to exploit large volumes of (weakly) annotated data, e.g. for training deep neural networks.

Category: Image Classification