Login

Register

Login

Register

Self-driving car dataset missing labels for pedestrians, cyclists – Naked Security


A popular self-driving car dataset for training machine-learning systems – one that’s used by thousands of students to build an open-source self-driving car – contains critical errors and omissions, including missing labels for hundreds of images of bicyclists and pedestrians.

Machine learning models are only as good as the data on which they’re trained. But when researchers at Roboflow, a firm that writes boilerplate computer vision code, hand-checked the 15,000 images in Udacity Dataset 2, they found problems with 4,986 – that’s 33% – of those images.

From a writeup of Roboflow’s findings, which were published by founder Brad Dwyer on Tuesday:

Amongst these [problematic data] were thousands of unlabeled vehicles, hundreds of unlabeled pedestrians, and dozens of unlabeled cyclists. We also found many instances of phantom annotations, duplicated bounding boxes, and drastically oversized bounding boxes.

Perhaps most egregiously, 217 (1.4%) of the images were completely unlabeled but actually contained cars, trucks, street lights, and/or pedestrians.

Junk in, junk out. In the case of the AI behind self-driving cars, junk data could literally lead to deaths. This is how Dwyer describes how bad/unlabelled data propagates through a machine learning system:

Generally speaking, machine learning models learn by example. You give it a photo, it makes a prediction, and then you nudge it a little bit in the direction that would have made its prediction more ‘right’. Where ‘right’ is defined as the ‘ground truth’, which is what your training data is.

If your training data’s ground truth is wrong, your model still happily learns from it, it’s just learning the wrong things (eg ‘that blob of pixels is *not* a cyclist’ vs ‘that blob of pixels *is* a cyclist’)

Neural networks do an Ok job of performing well despite *some* errors in their training data, but when 1/3 of the ground truth images have issues it’s definitely going to degrade performance.

National Cyber Security Consulting App

 https://apps.apple.com/us/app/id1521390354

https://play.google.com/store/apps/details?id=nationalcybersecuritycom.wpapp


Ads

NATIONAL CYBER SECURITY RADIO

Ads

ALEXA “OPEN NATIONAL CYBER SECURITY RADIO”

National Cyber Security Radio (Podcast) is now available for Alexa.  If you don't have an Alexa device, you can download the Alexa App for free for Google and Apple devices.   

nationalcybersecurity.com

FREE
VIEW