Sunday 28 January 2018

FAIR releases Detectron

Facebook’s AI research team(FAIR) has been working on the problem of object detection by using deep learning to give computers the ability to reach conclusions about what objects are present in a scene. The company’s object detection algorithm, based on the Caffe2 deep learning framework, is called Detectron. The Detectron project was started in July 2016 with the goal of creating a fast and flexible object detection system. It implements state-of-the-art object detection algorithms. It is written in Python and powered by the Caffe2 deep learning framework. The algorithms examine video input and are able to make guesses about what discrete objects comprise the scene.

At FAIR, Detectron has enabled numerous research projects, including: 

  • Feature Pyramid Networks for Object Detection: Feature pyramids are a basic component in recognition systems for detecting objects at different scales. But it is not currently recommended due to its compute and memory intensive nature.
  • Mask R-CNN: It is a general framework for object instance segmentation. In object instance segmentation, given an image, the goal is to label each pixel according to its object class as well as its object instance. Instance segmentation is closely related to two important tasks in computer vision, namely semantic segmentation and object detection. The goal of semantic segmentation is to label each pixel according to its object class. However, semantic segmentation does not differentiate between two different object instances of the same class. For example, if there are two persons in an image, semantic segmentation will assign the same label to pixels belonging to either of these two persons. The goal of object detection is to predict the bounding box and the object class of each object instance in the image. However, object detection does not provide per-pixel labeling of the object instance. Compared with semantic segmentation and object detection, object instance segmentation is strictly more challenging, since it aims to identify object instance as well as provide per-pixel labeling of each object instance.
  • Detecting and Recognizing Human-Object Interactions: To understand the visual world, a machine must not only recognize individual object instances but also how they interact. The Human-Object interaction is detected and represented as triplets<human, verb, object> in photos. Eg: <person, reads, book>
  • Focal Loss for Dense Object Detection: The highest accuracy object detectors to date are based on a two-stage approach popularized by R-CNN, where a classifier is applied to a sparse set of candidate object locations. In contrast, one-stage detectors that are applied over a regular, dense sampling of possible object locations have the potential to be faster and simpler, but have trailed the accuracy of two-stage detectors. An object detector named Retinanet is designed to identify the loss. RetinaNet is able to match the speed of previous one-stage detectors while surpassing the accuracy of all existing state-of-the-art two-stage detectors.
  • Non-local Neural Networks: Non-local means is an algorithm in image processing for image denoising. Unlike "local mean" filters, which take the mean value of a group of pixels surrounding a target pixel to smooth the image, non-local means filtering takes a mean of all pixels in the image, weighted by how similar these pixels are to the target pixel. This results in much greater post-filtering clarity, and less loss of detail in the image compared with local mean algorithms. Inspired by the classical non-local means method in computer vision, the non-local operation computes the response at a position as a weighted sum of the features at all positions. This building block can be plugged into many computer vision architectures.
  • Learning to Segment Every Thing: Existing methods for object instance segmentation require all training instances to be labeled with segmentation masks. This requirement makes it expensive to annotate new categories and has restricted instance segmentation models to ~100 well-annotated classes. A new partially supervised training paradigm is proposed, together with a novel weight transfer function, that enables training instance segmentation models over a large set of categories for which all have box annotations, but only a small fraction have mask annotations.
  • Data Distillation: Omni-supervised learning is a special area of semi-supervised learning in which the learner exploits all available labeled data plus internet-scale sources of unlabeled data. Data distillation is a method that ensembles predictions from multiple transformations of unlabeled data, using a single model, to automatically generate new training annotations.

The goal of Detectron is to provide a high-quality, high-performance codebase for object detection research. It is designed to be flexible in order to support rapid implementation and evaluation of novel research. Detectron includes implementations of the following object detection algorithms:

  • Mask R-CNN
  • RetinaNet
  • Faster R-CNN
  • RPN
  • Fast R-CNN
  • R-FCN


 From augmented reality to various computer vision tasks, Detectron has a wide variety of uses. One of the many things that this new platform can do is object masking. Object masking takes objected detection a step further and instead of just drawing a bounding box around the image, it can actually draw a complex polygon. Detectron is available under the Apache 2.0 licence at GitHub. The company says it is also releasing extensive performance baselines for more than 70 pre-trained models that are available to download from its model zoo on GitHub. Once the model is trained, it can be deployed on the cloud and even on mobile devices.

References


  1. https://www.techleer.com/articles/469-facebook-announces-open-sourcing-of-detectron-a-real-time-object-detection/
  2. https://github.com/facebookresearch/Detectron
  3. https://arxiv.org/


Success is walking from failure to failure with no loss of enthusiasm..