One approach to scene understanding is to train multiple decoders. Each decoder trains on a separate task. We might have one decoder for segmentation and another for depth measurement. This way we can have a single network which not only predicts the class of a pixel but additionally how far away it is. You can imagine using this information to reconstruct a rich 3D scene, kind of like how we human do. This is in fact, the next step to visual perception. For now though, we’ll keep things relatively simple focused just on semantic segmentation.