Benchmark Suite

The Planetscapes benchmark suite evaluates semantic, instance and panoptic segmentation methods on the dataset. Below are the primary evaluation metrics and placeholders for results and leaderboards.

Pixel-Level Semantic Labeling Task

The Planetscapes task involves predicting a per-pixel semantic labeling of the image without considering higher-level object instance or boundary information.

Metrics

To access performance, we rely on the standard Jaccard Index, commonly know as the Pascal VOC intersection-over-union metric IoU = TP / (TP+FP+FN), where TP, FP, and FN are the numbers of true positive, false positive, and false negative pixels, respectively, determined over the whole test set. Owning tow the two semantic granularities, i.e. classes and categories, we report two separate mean performance scores: IoU_{category} and IoU_{class}. In either case, pixels labeled as void do not contribute to the score. It is well-known that the global IoU measure is biased toward object instances that cover a large image area. In street scenes with their strong scale variation this can be problematic. Specifically for traffic participants, which are the key classes in our scenario, we aim to evaluate how well the individual instances in the scene are represented in the labeling. To address this, we additionally evaluate the semantic labeling using an instance-level intersection-over-union metric iIoU = iTP / (iTP + FP + iFN), where iTP, iFN are the true positive and false negative pixels weighted by the ratio of the average instance size to the size of the respective ground truth instance. However, in contrast to the standard IoU measure, iTP and iFN are computed by weighting the contribution of each pixel by the ratio of the class' average instance size to the size of the respective ground truth instance. It is important to note here that unlike the instance-level task below, we assume that the methods only yield a standard per-pixel semantic class labeling as output. Therefore, the false positive pixels are not associated with any instance and thus do not require normalization. The final scores, iIoU_{category} and iIoU_{class}, are computed as the means for the two semantic granularities.

Results

Leaderboard and detailed per-class scores will be provided here. For now, include your evaluation scripts and upload benchmark results to the repository.

Placeholder: benchmark tables and charts will appear here.

Note: The numbers below are placeholder/demo results (not official).

Method	Backbone	mIoU	Pixel Acc	AP	AP50	PQ
Baseline-Seg	ResNet-50	0.412	0.892	0.238	0.421	0.301
Transformer-Seg	Swin-T	0.468	0.906	0.271	0.459	0.337
Instance-Plus	ResNet-101	0.451	0.901	0.294	0.487	0.329
Ours (Demo)	ViT-B	0.502	0.917	0.315	0.512	0.361

Visual Question Answering Task

The Planetscapes VQA task evaluates multi-modal understanding of planetary scenes. Given an image and a natural-language question, the goal is to predict the correct answer (e.g., yes/no, counting, or short text).

Metrics

We report standard VQA-style accuracy, including overall accuracy and accuracy by answer type. The numbers below are placeholders for demonstration.

Overall Acc — overall question answering accuracy
Yes/No — accuracy on binary questions
Number — accuracy on counting/number questions
Other — accuracy on other open-ended questions

Results

A leaderboard and detailed breakdowns will be provided here. For now, we show a demo table.

Note: The numbers below are placeholder/demo results (not official).

Method	Backbone	Overall Acc	Yes/No	Number	Other
Baseline-VQA	CLIP-ViT-B/32	0.542	0.701	0.388	0.501
Transformer-VQA	ViT-B + Text-T	0.598	0.742	0.431	0.563
Ours (Demo)	ViT-L	0.631	0.768	0.462	0.601