University of Waterloo

PostDoc Seminar

Zahra Gharaee, Post Doc, VIP Lab

Nov 24, 2023, 11:30 am, EC4-2101A


Presenting the BIOSCAN-1M Insect Dataset: an extensive repository comprising one million meticulously labeled insect images. Each record undergoes expert taxonomic classification and is augmented with genetic information, including raw nucleotide barcode sequences and barcode index numbers—a genetic-based proxy for species classification. The dataset’s primary purpose is to facilitate the training of advanced computer-vision models for image-based taxonomic assessments. Beyond its utility in machine learning, the BIOSCAN-1M Insect Dataset stands out with its distinctive characteristics, including a notable long-tailed class-imbalance distribution commonly observed in biological datasets. The taxonomic labeling follows a hierarchical classification scheme, presenting a nuanced and fine-grained classification challenge, especially at lower taxonomic levels. This unique feature not only contributes to machine learning advancements but also enhances its relevance for the broader machine learning community. In addition to its significance in machine learning, the BIOSCAN-1M Insect Dataset aims to ignite interest in biodiversity research. Its role extends beyond serving as a tool for training classifiers, contributing to the broader goal of BIOSCAN research: laying the groundwork for a comprehensive survey of global biodiversity.