The Planetscapes benchmark suite evaluates semantic, instance and panoptic segmentation methods on the dataset. Below are the primary evaluation metrics and placeholders for results and leaderboards.
The Planetscapes task involves predicting a per-pixel semantic labeling of the image without considering higher-level object instance or boundary information.
To access performance, we rely on the standard Jaccard Index, commonly know as the Pascal VOC intersection-over-union metric IoU = TP / (TP+FP+FN), where TP, FP, and FN are the numbers of true positive, false positive, and false negative pixels, respectively, determined over the whole test set. Owning tow the two semantic granularities, i.e. classes and categories, we report two separate mean performance scores: IoU_{category} and IoU_{class}. In either case, pixels labeled as void do not contribute to the score. It is well-known that the global IoU measure is biased toward object instances that cover a large image area. In street scenes with their strong scale variation this can be problematic. Specifically for traffic participants, which are the key classes in our scenario, we aim to evaluate how well the individual instances in the scene are represented in the labeling. To address this, we additionally evaluate the semantic labeling using an instance-level intersection-over-union metric iIoU = iTP / (iTP + FP + iFN), where iTP, iFN are the true positive and false negative pixels weighted by the ratio of the average instance size to the size of the respective ground truth instance. However, in contrast to the standard IoU measure, iTP and iFN are computed by weighting the contribution of each pixel by the ratio of the class' average instance size to the size of the respective ground truth instance. It is important to note here that unlike the instance-level task below, we assume that the methods only yield a standard per-pixel semantic class labeling as output. Therefore, the false positive pixels are not associated with any instance and thus do not require normalization. The final scores, iIoU_{category} and iIoU_{class}, are computed as the means for the two semantic granularities.
Leaderboard and detailed per-class scores will be provided here. For now, include your evaluation scripts and upload benchmark results to the repository.
Placeholder: benchmark tables and charts will appear here.
Note: The numbers below are placeholder/demo results (not official).
| Method | Backbone | mIoU | Pixel Acc | AP | AP50 | PQ |
|---|---|---|---|---|---|---|
| Baseline-Seg | ResNet-50 | 0.412 | 0.892 | 0.238 | 0.421 | 0.301 |
| Transformer-Seg | Swin-T | 0.468 | 0.906 | 0.271 | 0.459 | 0.337 |
| Instance-Plus | ResNet-101 | 0.451 | 0.901 | 0.294 | 0.487 | 0.329 |
| Ours (Demo) | ViT-B | 0.502 | 0.917 | 0.315 | 0.512 | 0.361 |
The Planetscapes VQA task evaluates multi-modal understanding of planetary scenes. Given an image and a natural-language question, the goal is to predict the correct answer (e.g., yes/no, counting, or short text).
We report standard VQA-style accuracy, including overall accuracy and accuracy by answer type. The numbers below are placeholders for demonstration.
A leaderboard and detailed breakdowns will be provided here. For now, we show a demo table.
Note: The numbers below are placeholder/demo results (not official).
| Method | Backbone | Overall Acc | Yes/No | Number | Other |
|---|---|---|---|---|---|
| Baseline-VQA | CLIP-ViT-B/32 | 0.542 | 0.701 | 0.388 | 0.501 |
| Transformer-VQA | ViT-B + Text-T | 0.598 | 0.742 | 0.431 | 0.563 |
| Ours (Demo) | ViT-L | 0.631 | 0.768 | 0.462 | 0.601 |