ultralytics 8.0.149 add Open Images V7 training (#4178)

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: AdiEcho <30563671+AdiEcho@users.noreply.github.com>
2023-08-07 01:19:51 +02:00
parent c751c7f88a
commit 7565210484
110 changed files with 969 additions and 144 deletions
--- a/docs/datasets/detect/argoverse.md
+++ b/docs/datasets/detect/argoverse.md
@ -8,6 +8,11 @@ keywords: Argoverse dataset, autonomous driving, YOLO, 3D tracking, motion forec

 The [Argoverse](https://www.argoverse.org/) dataset is a collection of data designed to support research in autonomous driving tasks, such as 3D tracking, motion forecasting, and stereo depth estimation. Developed by Argo AI, the dataset provides a wide range of high-quality sensor data, including high-resolution images, LiDAR point clouds, and map data.

+!!! note
+
+    The Argoverse dataset *.zip file required for training was removed from Amazon S3 after the shutdown of Argo AI by Ford, but we have made it available for manual download on [Google Drive](https://drive.google.com/file/d/1st9qW3BeIwQsnR0t8mRpvbsSWIo16ACi/view?usp=drive_link).
+
+
 ## Key Features

 - Argoverse contains over 290K labeled 3D object tracks and 5 million object instances across 1,263 distinct scenes.
--- a/docs/datasets/detect/index.md
+++ b/docs/datasets/detect/index.md
@ -32,7 +32,7 @@ names:
  79: toothbrush
 ```

-Labels for this format should be exported to YOLO format with one `*.txt` file per image. If there are no objects in an image, no `*.txt` file is required. The `*.txt` file should be formatted with one row per object in `class x_center y_center width height` format. Box coordinates must be in **normalized xywh** format (from 0 - 1). If your boxes are in pixels, you should divide `x_center` and `width` by image width, and `y_center` and `height` by image height. Class numbers should be zero-indexed (start with 0).
+Labels for this format should be exported to YOLO format with one `*.txt` file per image. If there are no objects in an image, no `*.txt` file is required. The `*.txt` file should be formatted with one row per object in `class x_center y_center width height` format. Box coordinates must be in **normalized xywh** format (from 0 to 1). If your boxes are in pixels, you should divide `x_center` and `width` by image width, and `y_center` and `height` by image height. Class numbers should be zero-indexed (start with 0).

 <p align="center"><img width="750" src="https://user-images.githubusercontent.com/26833433/91506361-c7965000-e886-11ea-8291-c72b98c25eec.jpg"></p>

@ -59,13 +59,13 @@ Here's how you can use these formats to train your model:
        model = YOLO('yolov8n.pt')  # load a pretrained model (recommended for training)

        # Train the model
-        model.train(data='coco128.yaml', epochs=100, imgsz=640)
+        model.train(data='coco8.yaml', epochs=100, imgsz=640)
        ```
    === "CLI"

        ```bash
        # Start training from a pretrained *.pt model
-        yolo detect train data=coco128.yaml model=yolov8n.pt epochs=100 imgsz=640
+        yolo detect train data=coco8.yaml model=yolov8n.pt epochs=100 imgsz=640
        ```

 ## Supported Datasets
@ -77,6 +77,7 @@ Here is a list of the supported datasets and a brief description for each:
 - [**COCO8**](./coco8.md): A smaller subset of the COCO dataset, COCO8 is more lightweight and faster to train.
 - [**GlobalWheat2020**](./globalwheat2020.md): A dataset containing images of wheat heads for the Global Wheat Challenge 2020.
 - [**Objects365**](./objects365.md): A large-scale object detection dataset with 365 object categories and 600k images, aimed at advancing object detection research.
+- [**OpenImagesV7**](./open-images-v7.md): A comprehensive dataset by Google with 1.7M train images and 42k validation images.
 - [**SKU-110K**](./sku-110k.md): A dataset containing images of densely packed retail products, intended for retail environment object detection.
 - [**VisDrone**](./visdrone.md): A dataset focusing on drone-based images, containing various object categories like cars, pedestrians, and cyclists.
 - [**VOC**](./voc.md): PASCAL VOC is a popular object detection dataset with 20 object categories including vehicles, animals, and furniture.
--- a/docs/datasets/detect/open-images-v7.md
+++ b/docs/datasets/detect/open-images-v7.md
@ -0,0 +1,106 @@
+---
+comments: true
+description: Dive into Google's Open Images V7, a comprehensive dataset offering a broad scope for computer vision research. Understand its usage with deep learning models.
+keywords: Open Images V7, object detection, segmentation masks, visual relationships, localized narratives, computer vision, deep learning, annotations, bounding boxes
+---
+
+# Open Images V7 Dataset
+
+[Open Images V7](https://storage.googleapis.com/openimages/web/index.html) is a versatile and expansive dataset championed by Google. Aimed at propelling research in the realm of computer vision, it boasts a vast collection of images annotated with a plethora of data, including image-level labels, object bounding boxes, object segmentation masks, visual relationships, and localized narratives.
+
+![Open Images V7 classes visual](https://user-images.githubusercontent.com/26833433/258660358-2dc07771-ec08-4d11-b24a-f66e07550050.png)
+
+## Key Features
+
+- Encompasses ~9M images annotated in various ways to suit multiple computer vision tasks.
+- Houses a staggering 16M bounding boxes across 600 object classes in 1.9M images. These boxes are primarily hand-drawn by experts ensuring high precision.
+- Visual relationship annotations totaling 3.3M are available, detailing 1,466 unique relationship triplets, object properties, and human activities.
+- V5 introduced segmentation masks for 2.8M objects across 350 classes.
+- V6 introduced 675k localized narratives that amalgamate voice, text, and mouse traces highlighting described objects.
+- V7 introduced 66.4M point-level labels on 1.4M images, spanning 5,827 classes.
+- Encompasses 61.4M image-level labels across a diverse set of 20,638 classes.
+- Provides a unified platform for image classification, object detection, relationship detection, instance segmentation, and multimodal image descriptions.
+
+## Dataset Structure
+
+Open Images V7 is structured in multiple components catering to varied computer vision challenges:
+
+- **Images**: About 9 million images, often showcasing intricate scenes with an average of 8.3 objects per image.
+- **Bounding Boxes**: Over 16 million boxes that demarcate objects across 600 categories.
+- **Segmentation Masks**: These detail the exact boundary of 2.8M objects across 350 classes.
+- **Visual Relationships**: 3.3M annotations indicating object relationships, properties, and actions.
+- **Localized Narratives**: 675k descriptions combining voice, text, and mouse traces.
+- **Point-Level Labels**: 66.4M labels across 1.4M images, suitable for zero/few-shot semantic segmentation.
+
+## Applications
+
+Open Images V7 is a cornerstone for training and evaluating state-of-the-art models in various computer vision tasks. The dataset's broad scope and high-quality annotations make it indispensable for researchers and developers specializing in computer vision.
+
+## Dataset YAML
+
+Typically, datasets come with a YAML (Yet Another Markup Language) file that delineates the dataset's configuration. For the case of Open Images V7, a hypothetical `OpenImagesV7.yaml` might exist. For accurate paths and configurations, one should refer to the dataset's official repository or documentation.
+
+!!! example "OpenImagesV7.yaml"
+
+    ```yaml
+    --8<-- "ultralytics/cfg/datasets/open-images-v7.yaml"
+    ```
+
+## Usage
+
+To train a YOLOv8n model on the Open Images V7 dataset for 100 epochs with an image size of 640, you can use the following code snippets. For a comprehensive list of available arguments, refer to the model [Training](../../modes/train.md) page.
+
+!!! warning
+
+    The complete Open Images V7 dataset comprises 1,743,042 training images and 41,620 validation images, requiring approximately **561 GB of storage space** upon download.
+    
+    Executing the commands provided below will trigger an automatic download of the full dataset if it's not already present locally. Before running the below example it's crucial to:
+    
+    - Verify that your device has enough storage capacity.
+    - Ensure a robust and speedy internet connection.
+
+!!! example "Train Example"
+
+    === "Python"
+
+        ```python
+        from ultralytics import YOLO
+
+        # Load a COCO-pretrained YOLOv8n model
+        model = YOLO('yolov8n.pt')
+
+        # Train the model on the Open Images V7 dataset
+        model.train(data='open-images-v7.yaml', epochs=100, imgsz=640)
+        ```
+
+    === "CLI"
+
+        ```bash
+        # Train a COCO-pretrained YOLOv8n model on the Open Images V7 dataset
+        yolo detect train data=open-images-v7.yaml model=yolov8n.pt epochs=100 imgsz=640
+        ```
+
+## Sample Data and Annotations
+
+Illustrations of the dataset help provide insights into its richness:
+
+![Dataset sample image](https://storage.googleapis.com/openimages/web/images/oidv7_all-in-one_example_ab.jpg)
+
+- **Open Images V7**: This image exemplifies the depth and detail of annotations available, including bounding boxes, relationships, and segmentation masks.
+
+Researchers can gain invaluable insights into the array of computer vision challenges that the dataset addresses, from basic object detection to intricate relationship identification.
+
+## Citations and Acknowledgments
+
+For those employing Open Images V7 in their work, it's prudent to cite the relevant papers and acknowledge the creators:
+
+```bibtex
+@article{OpenImages,
+  author = {Alina Kuznetsova and Hassan Rom and Neil Alldrin and Jasper Uijlings and Ivan Krasin and Jordi Pont-Tuset and Shahab Kamali and Stefan Popov and Matteo Malloci and Alexander Kolesnikov and Tom Duerig and Vittorio Ferrari},
+  title = {The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale},
+  year = {2020},
+  journal = {IJCV}
+}
+```
+
+A heartfelt acknowledgment goes out to the Google AI team for creating and maintaining the Open Images V7 dataset. For a deep dive into the dataset and its offerings, navigate to the [official Open Images V7 website](https://storage.googleapis.com/openimages/web/index.html).
--- a/docs/datasets/index.md
+++ b/docs/datasets/index.md
@ -12,48 +12,49 @@ Ultralytics provides support for various datasets to facilitate computer vision

 Bounding box object detection is a computer vision technique that involves detecting and localizing objects in an image by drawing a bounding box around each object.

-* [Argoverse](detect/argoverse.md): A dataset containing 3D tracking and motion forecasting data from urban environments with rich annotations.
-* [COCO](detect/coco.md): A large-scale dataset designed for object detection, segmentation, and captioning with over 200K labeled images.
-* [COCO8](detect/coco8.md): Contains the first 4 images from COCO train and COCO val, suitable for quick tests.
-* [Global Wheat 2020](detect/globalwheat2020.md): A dataset of wheat head images collected from around the world for object detection and localization tasks.
-* [Objects365](detect/objects365.md): A high-quality, large-scale dataset for object detection with 365 object categories and over 600K annotated images.
-* [SKU-110K](detect/sku-110k.md): A dataset featuring dense object detection in retail environments with over 11K images and 1.7 million bounding boxes.
-* [VisDrone](detect/visdrone.md): A dataset containing object detection and multi-object tracking data from drone-captured imagery with over 10K images and video sequences.
-* [VOC](detect/voc.md): The Pascal Visual Object Classes (VOC) dataset for object detection and segmentation with 20 object classes and over 11K images.
-* [xView](detect/xview.md): A dataset for object detection in overhead imagery with 60 object categories and over 1 million annotated objects.
+- [Argoverse](detect/argoverse.md): A dataset containing 3D tracking and motion forecasting data from urban environments with rich annotations.
+- [COCO](detect/coco.md): A large-scale dataset designed for object detection, segmentation, and captioning with over 200K labeled images.
+- [COCO8](detect/coco8.md): Contains the first 4 images from COCO train and COCO val, suitable for quick tests.
+- [Global Wheat 2020](detect/globalwheat2020.md): A dataset of wheat head images collected from around the world for object detection and localization tasks.
+- [Objects365](detect/objects365.md): A high-quality, large-scale dataset for object detection with 365 object categories and over 600K annotated images.
+- [OpenImagesV7](detect/open-images-v7.md): A comprehensive dataset by Google with 1.7M train images and 42k validation images.
+- [SKU-110K](detect/sku-110k.md): A dataset featuring dense object detection in retail environments with over 11K images and 1.7 million bounding boxes.
+- [VisDrone](detect/visdrone.md): A dataset containing object detection and multi-object tracking data from drone-captured imagery with over 10K images and video sequences.
+- [VOC](detect/voc.md): The Pascal Visual Object Classes (VOC) dataset for object detection and segmentation with 20 object classes and over 11K images.
+- [xView](detect/xview.md): A dataset for object detection in overhead imagery with 60 object categories and over 1 million annotated objects.

 ## [Instance Segmentation Datasets](segment/index.md)

 Instance segmentation is a computer vision technique that involves identifying and localizing objects in an image at the pixel level.

-* [COCO](segment/coco.md): A large-scale dataset designed for object detection, segmentation, and captioning tasks with over 200K labeled images.
-* [COCO8-seg](segment/coco8-seg.md): A smaller dataset for instance segmentation tasks, containing a subset of 8 COCO images with segmentation annotations.
+- [COCO](segment/coco.md): A large-scale dataset designed for object detection, segmentation, and captioning tasks with over 200K labeled images.
+- [COCO8-seg](segment/coco8-seg.md): A smaller dataset for instance segmentation tasks, containing a subset of 8 COCO images with segmentation annotations.

 ## [Pose Estimation](pose/index.md)

 Pose estimation is a technique used to determine the pose of the object relative to the camera or the world coordinate system.

-* [COCO](pose/coco.md): A large-scale dataset with human pose annotations designed for pose estimation tasks.
-* [COCO8-pose](pose/coco8-pose.md): A smaller dataset for pose estimation tasks, containing a subset of 8 COCO images with human pose annotations.
+- [COCO](pose/coco.md): A large-scale dataset with human pose annotations designed for pose estimation tasks.
+- [COCO8-pose](pose/coco8-pose.md): A smaller dataset for pose estimation tasks, containing a subset of 8 COCO images with human pose annotations.

 ## [Classification](classify/index.md)

 Image classification is a computer vision task that involves categorizing an image into one or more predefined classes or categories based on its visual content.

-* [Caltech 101](classify/caltech101.md): A dataset containing images of 101 object categories for image classification tasks.
-* [Caltech 256](classify/caltech256.md): An extended version of Caltech 101 with 256 object categories and more challenging images.
-* [CIFAR-10](classify/cifar10.md): A dataset of 60K 32x32 color images in 10 classes, with 6K images per class.
-* [CIFAR-100](classify/cifar100.md): An extended version of CIFAR-10 with 100 object categories and 600 images per class.
-* [Fashion-MNIST](classify/fashion-mnist.md): A dataset consisting of 70,000 grayscale images of 10 fashion categories for image classification tasks.
-* [ImageNet](classify/imagenet.md): A large-scale dataset for object detection and image classification with over 14 million images and 20,000 categories.
-* [ImageNet-10](classify/imagenet10.md): A smaller subset of ImageNet with 10 categories for faster experimentation and testing.
-* [Imagenette](classify/imagenette.md): A smaller subset of ImageNet that contains 10 easily distinguishable classes for quicker training and testing.
-* [Imagewoof](classify/imagewoof.md): A more challenging subset of ImageNet containing 10 dog breed categories for image classification tasks.
-* [MNIST](classify/mnist.md): A dataset of 70,000 grayscale images of handwritten digits for image classification tasks.
+- [Caltech 101](classify/caltech101.md): A dataset containing images of 101 object categories for image classification tasks.
+- [Caltech 256](classify/caltech256.md): An extended version of Caltech 101 with 256 object categories and more challenging images.
+- [CIFAR-10](classify/cifar10.md): A dataset of 60K 32x32 color images in 10 classes, with 6K images per class.
+- [CIFAR-100](classify/cifar100.md): An extended version of CIFAR-10 with 100 object categories and 600 images per class.
+- [Fashion-MNIST](classify/fashion-mnist.md): A dataset consisting of 70,000 grayscale images of 10 fashion categories for image classification tasks.
+- [ImageNet](classify/imagenet.md): A large-scale dataset for object detection and image classification with over 14 million images and 20,000 categories.
+- [ImageNet-10](classify/imagenet10.md): A smaller subset of ImageNet with 10 categories for faster experimentation and testing.
+- [Imagenette](classify/imagenette.md): A smaller subset of ImageNet that contains 10 easily distinguishable classes for quicker training and testing.
+- [Imagewoof](classify/imagewoof.md): A more challenging subset of ImageNet containing 10 dog breed categories for image classification tasks.
+- [MNIST](classify/mnist.md): A dataset of 70,000 grayscale images of handwritten digits for image classification tasks.

 ## [Multi-Object Tracking](track/index.md)

 Multi-object tracking is a computer vision technique that involves detecting and tracking multiple objects over time in a video sequence.

-* [Argoverse](detect/argoverse.md): A dataset containing 3D tracking and motion forecasting data from urban environments with rich annotations for multi-object tracking tasks.
-* [VisDrone](detect/visdrone.md): A dataset containing object detection and multi-object tracking data from drone-captured imagery with over 10K images and video sequences.
+- [Argoverse](detect/argoverse.md): A dataset containing 3D tracking and motion forecasting data from urban environments with rich annotations for multi-object tracking tasks.
+- [VisDrone](detect/visdrone.md): A dataset containing object detection and multi-object tracking data from drone-captured imagery with over 10K images and video sequences.