Moving towards quantitative grading of vesicoureteral reflux from voiding cystourethrograms
Adree Khondker1,2, Jethro Kwong3,4, Priyank Yadav2, Justin Chan4, Anuradha Singh5, Marta Skreta6, Lauren Erdman2,3,6, Daniel T. Keefe7, Mandy Rickard2, Armando Lorenzo2.
1Temerty Faculty of Medicine, University of Toronto, Toronto, ON, Canada; 2Division of Urology, The Hospital for Sick Children, Toronto, ON, Canada; 3Temerty Centre for AI Research and Education in Medicine, University of Toronto, Toronto, ON, Canada; 4Division of Urology, Department of Surgery, University of Toronto, Toronto, ON, Canada; 5Division of Diagnostic Imaging, The Hospital for Sick Children, Toronto, ON, Canada; 6Department of Computer Science, University of Toronto, Toronto, ON, Canada; 7Division of Urology, IWK Hospital, Halifax, NS, Canada
Introduction: Vesicoureteral reflux (VUR) grading from voiding cystourethrograms (VCUGs) has poor inter-rater agreement between clinicians. We sought to integrate more objective means of grading VUR with machine learning (ML) to standardize and improve the current VUR grading system.
Methods: We retrospectively reviewed our institutional VCUG imaging repository between January 2013 and December 2019. Each VCUG was split into left and right renal units, respectively containing the whole ureter and kidney, and then assessed for reflux. Each renal unit was then annotated to generate features for supervised ML. The four features abstracted include: ureter tortuosity, proximal ureter width, distal ureter width, and max ureter width (Figure 1). Due to the highly variable grading of VUR, each included renal unit was graded by at least five raters to determine a consensus grade. Inter-rater reliability was determined to assess the validity of grading. Multiclass classification was trained with a SVM model to distinguish individual VUR grades.1
Results: A total of 6288 renal units (from 3144 VCUGs) were identified in the study period and screened for VUR. Of these, 1935 renal units had documented VUR, with 1248 being included in the ML model. A total of 7986 independent VUR grades from 5+ raters for each renal unit were collected. The included cohort consisted of a 50/50 male/female split, with a median age at imaging of 0.74 years (interquartile range 0.25, 3.12). The overall Fleiss’ kappa for inter-rater agreement was 0.44. The model performed with 66% accuracy (area under the curve=0.82) on 80/20 holdout validation. Our model determined clinically significant differences in PUV patients requiring dialysis and was moderately correlated to hydronephrosis resolution post-pyeloplasty.
Conclusions: VUR grading by quantitative metrics is feasible in large datasets and can be supported by ML-based methods. VUR features may be correlated with clinical outcomes but further validation and prospective study are warranted.
[1] [1] Khondker A, Kwong JC, Rickard M, Skreta M, Keefe DT, Lorenzo AJ, Erdman L. A machine learning-based approach for quantitative grading of vesicoureteral reflux from voiding cystourethrograms: Methods and proof of concept. Journal of Pediatric Urology. 2021 Oct 19.