itk_evaluate

Evaluate segmentation predictions against ground truth by calculating comprehensive metrics and volume statistics.

Usage

itk_evaluate <gt_folder> <pred_folder> <save_folder> [options]

gt_folder: Folder containing ground truth masks (.mha, .nii, .nii.gz, .nrrd files)
pred_folder: Folder containing prediction masks (same format as gt_folder)
save_folder: Folder to save evaluation results (created if not exists)
--format {csv,excel}: Output format (default: excel)
csv: Saves 5 separate CSV files
excel: Saves 1 Excel file with 5 sheets
--mp: Enable multiprocessing for faster evaluation
--workers N: Number of worker processes (default: half of CPU cores)

Automatic Resampling: Predictions are automatically resampled to match ground truth spacing/size if they differ
Consistent Orientation: All samples are oriented to LPI for consistent evaluation
Multi-class Support: Handles multi-class segmentation with per-class metrics
Volume Statistics: Calculates volumes in cubic millimeters for both GT and predictions
Multiple Aggregation Views: Provides metrics in 5 different aggregation formats

For each class, the following metrics are computed:

For each class in each sample:

Volumes are computed as: voxel_count × (spacing_x × spacing_y × spacing_z)

The tool generates 5 different views of the evaluation results:

Mean metrics for each class across all samples.

metric	class_0	class_1	class_2	...
dice	0.95	0.92	0.88	...
iou	0.91	0.85	0.79	...
recall	0.94	0.91	0.87	...

Detailed metrics for each sample and each class (most granular view).

sample	class_0_dice	class_0_iou	class_1_dice	...	accuracy
case001	0.96	0.92	0.94	...	0.95
case002	0.94	0.89	0.90	...	0.93

Mean metric across all classes for each sample.

sample	dice	iou	fscore	recall	precision	accuracy
case001	0.95	0.91	0.95	0.94	0.96	0.95
case002	0.92	0.85	0.92	0.91	0.93	0.93

Volume in cubic millimeters for each class in each sample (from ground truth).

sample	class_0	class_1	class_2	...
case001	1250.5	3420.8	890.2	...
case002	1180.3	3350.1	910.5	...

Volume in cubic millimeters for each class in each sample (from prediction).

sample	class_0	class_1	class_2	...
case001	1245.2	3380.5	885.1	...
case002	1175.8	3320.3	915.2	...