BIOSCAN’s global biodiversity assessment aims to comprehensively catalog living organisms worldwide, encompassing the intricate tapestry of insect biodiversity. As a fundamental component of global ecosystems, insects contribute significantly to pollination, nutrient cycling, and overall ecosystem stability, embodying a remarkable diversity of species. In pursuit of this goal, a meticulously curated collection exceeding one million hand-labelled insect images has been created. Each image undergoes expert taxonomic classification and is enriched with genetic data. The initial phase of the project has been successfully completed with the release of the BIOSCAN-1M Insect Dataset, featured in the “Advances in Neural Information Processing Systems (NeurIPS 2023) Datasets & Benchmarks Track. “The published work can be explored on the project website, accessible through: https://biodiversitygenomics.net/1M_insects/.
We propose a rich repository featuring one million intricately labeled insect images. Each entry undergoes expert taxonomic classification and is enriched with genetic data, including raw nucleotide barcode sequences and barcode index numbers—a genetic-based proxy for species classification. The BIOSCAN-1M Insect Dataset is designed primarily to support the training of advanced computer-vision models for image-based taxonomic assessments. Apart from its application in machine learning, it distinguishes itself with unique attributes, such as a distinctive long-tailed class-imbalance distribution commonly observed in biological datasets. The taxonomic labeling adopts a hierarchical classification scheme, presenting a nuanced and fine-grained challenge, particularly at finer-grained taxonomic levels. This distinctive feature not only advances machine learning but also extends its relevance to the broader machine learning community.