NAFlora-1M: Continental-Scale High-Resolution Fine-Grained Plant Classification Dataset

Abstract

The plant kingdom exhibits remarkable diversity that must be maintained for global ecosystem sustainability. However, plant life is currently disproportionately disappearing at a rapid rate, putting many essential functions-such as ecosystem production, resistance, and resilience-at risk. Plant specimen identification-the first step of plant biodiversity research-is heavily bottlenecked by a shortage of qualified experts. The botanical community has imaged large volumes of annotated physical herbarium specimens, which present a huge potential for building artificial intelligence systems that can assist researchers. In this paper, we present a novel large-scale, fine-grained dataset, NAFlora-1M, which consists of 1,050,182 hebarium images covering 15,501 North American vascular plant species (90 of the known species). Addressing gaps from previous research efforts, NAFlora-1M is the first–ever dataset to closely replicate the real-world task of herbarium specimen identification, as the dataset is intended to cover as many of the taxa in North America as possible. We highlight some key characteristics of NAFlora-1M from a machine learning dataset perspective: high-quality labels rigorously peer-reviewed by experts; hierarchical class structure; long–tailed and imbalanced class distribution; high image resolution; and extensive image quality control for consistent scale and color. In addition, we present several baseline models, along with benchmarking results from a Kaggle competition: A total of 134 teams benchmarked the dataset in a total of 1,663 submissions; the leading team achieved an 87.66 macro-F score with a 1–billion–parameter ensemble model—leaving substantial room for future improvement in both performance and efficiency. We believe that NAFlora1M is an excellent starting point to encourage the development of botanical AI applications, thereby facilitating enhanced monitoring of plant diversity and conservation efforts. The dataset and training scripts are available at https://github.com/dpl10/NAFlora-1M.

Publication
In Journal of Data-centric Machine Learning Research
Riccardo de Lutio
Riccardo de Lutio
Research Scientist

I am a Research Scientist at NVIDIA in the Toronto AI Lab, generally interested in 3D computer vision and neural reconstruction.