Vision transformer and explainable transfer learning models for auto detection of kidney cyst, stone and tumor from ct-radiography

Play all audios:

ABSTRACT Renal failure, a public health concern, and the scarcity of nephrologists around the globe have necessitated the development of an AI-based system to auto-diagnose kidney diseases.

This research deals with the three major renal diseases categories: kidney stones, cysts, and tumors, and gathered and annotated a total of 12,446 CT whole abdomen and urogram images in

order to construct an AI-based kidney diseases diagnostic system and contribute to the AI community’s research scope e.g., modeling digital-twin of renal functions. The collected images were

exposed to exploratory data analysis, which revealed that the images from all of the classes had the same type of mean color distribution. Furthermore, six machine learning models were

built, three of which are based on the state-of-the-art variants of the Vision transformers EANet, CCT, and Swin transformers, while the other three are based on well-known deep learning

models Resnet, VGG16, and Inception v3, which were adjusted in the last layers. While the VGG16 and CCT models performed admirably, the swin transformer outperformed all of them in terms of

accuracy, with an accuracy of 99.30 percent. The F1 score and precision and recall comparison reveal that the Swin transformer outperforms all other models and that it is the quickest to

train. The study also revealed the blackbox of the VGG16, Resnet50, and Inception models, demonstrating that VGG16 is superior than Resnet50 and Inceptionv3 in terms of monitoring the

necessary anatomy abnormalities. We believe that the superior accuracy of our Swin transformer-based model and the VGG16-based model can both be useful in diagnosing kidney tumors, cysts,

and stones. SIMILAR CONTENT BEING VIEWED BY OTHERS FINE-TUNED DEEP LEARNING MODELS FOR EARLY DETECTION AND CLASSIFICATION OF KIDNEY CONDITIONS IN CT IMAGING Article Open access 28 March 2025

IDENTIFICATION OF KIDNEY STONES IN KUB X-RAY IMAGES USING VGG16 EMPOWERED WITH EXPLAINABLE ARTIFICIAL INTELLIGENCE Article Open access 14 March 2024 LEVERAGING ENSEMBLE CONVOLUTIONAL NEURAL

NETWORKS AND METAHEURISTIC STRATEGIES FOR ADVANCED KIDNEY DISEASE SCREENING AND CLASSIFICATION Article Open access 11 April 2025 INTRODUCTION Kidney disease is a public health concern since

the disease is spreading despite current control attempts1. Chronic kidney disease affects more than 10% of the world population2, and it was ranked 16th among the leading causes of death

in 2016 and is expected to jump to 5th by 20403. Among the other kidney diseases, cyst formation, nephrolithiasis (kidney stone), and renal cell carcinoma (kidney tumor) are the most

frequent kidney illnesses that impede kidney function. A kidney cyst is a fluid-filled pocket that forms on the surface of the kidney and is enclosed by a thin wall. Within the kidneys, one

or more cysts may develop with water density: From 0 to 20 Hounsfield units4,5,6. Kidney stone disease is characterized by the formation of crystal concretions within the kidneys, which

affects about 12% of the world population7. Renal cell carcinoma (RCC), often known as kidney tumor, is one of the ten most prevalent cancers in the world8. X-ray, computed tomography (CT),

B-ultrasound machines (US), and MRI (magnetic resonance imaging) machines are often used in conjunction with pathology tests to diagnose kidney diseases. The CT machine scans the desired

part of the human anatomy with X-ray beams to obtain a cross-sectional image which provides three-dimensional information about the desired anatomy9. CT scans in kidney examinations are

ideal for study because they provide three-dimensional information and slice-by-slice images. If kidney abnormalities such as cysts, stones, and tumors are not detected and treated early,

they might lead to renal failure10. For this reason, early diagnosis of renal disorders like kidney cysts, stones, and tumors appears to be an important step in preventing kidney failure11.

On the other hand, the number of nephrologists and radiologist is very limited. In South Asia, there is barely one nephrologist per million people, where in Europe there are 25.3

nephrologists per million people12. Considering the sufferings of the population due to kidney diseases, the shortage of nephrologists and radiologists around the globe, and the advancement

of deep learning research in vision tasks, it has become imperative to build an AI (artificial intelligence) model to detect kidney radiological findings easily to assist doctors, and reduce

the sufferings of people. A few studies have been published in recent years in this domain. However, the publicly available data set is scarce. In addition, most past studies have utilized

traditional machine learning algorithms to classify single classes of disease only; either cysts, or either tumors, or either stones. Some studies utilised ultrasound (US) images. In this

work, we created and annotated the “CT KIDNEY DATASET: Normal-Cyst-Tumor and Stone” dataset13, implemented a total of six models, and evaluated each of them to come to the conclusion which

model is best suited to use in realtime. The proposed auto-detection model for the diagnosis of kidney diseases will also help to build a digital twin of renal function at the pathology

level, such as tumor growth. No study that we are aware of has done an analysis based on a transformer model with renal cyst, tumor and stone auto detection. The following are the major

contributions of this work: * A dataset namely “CT KIDNEY DATASET: Normal-Cyst-Tumor and Stone” is collected and annotated with 12,446 images utilizing the whole abdomen and the eurogram

protocol. * Three CNN-based deep learning models (i.e., VGG16, Resnet50, and Inception v3) using transfer learning approach are applied to detect kidney abnormalities and presented a

thorough performance study, including explanation of the black-box of the suggested models using gradient weighted class activation mapping (i.e., GradCam). * Three recent state-of-the-art

Vision transformer variants (i.e., EANet, CCT, and Swin transformers) are applied on the CT kidney dataset and the performances of the models are presented using the confusion matrix,

accuracy, sensitivity, specificity, and F1 score. The rest of the paper is organized in the following manner. Section II provides background and details on utilizing deep learning to

identify kidney abnormalities. The methodology for this letter is discussed in Section III, which includes data collection processes, data preprocessing, neural network models employed in

this study and the result evaluation processes. Section IV deals with the result study, and the concluding remarks are presented in Section V. BACKGROUND STUDY Because of the advent of deep

learning and its implementation in image processing and classification, a considerable amount of research has grown in deep learning applications, specifically in autodiagnosis of

radiological findings and segmentation tasks. In the classification task that employs a transfer learning technique, ResNet14 inception15, exception16, EfficientNet17 networks have grown in

prominence over time. Transfer learning is an approach in deep learning where pre-trained models are used as the starting point for specified tasks. It refers to the application of a

previously learnt model to a new challenge. In recent days, popularly used transformer models for natural language processing are being introduced in computer vision tasks, which are showing

supremacy and good results over other models while doing classification tasks. The Vision transformer (ViT)18 and several variations of the Vision transformer, like the Big Transformer

(BiT)19, EANet (External Attention Transformer)20, Compact Convolutional Transformer (CCT)21, and Swin Transformer (Shifted Window Transformer)22 are utilizing attention based mechanism

where basic analysis unit is pixels of images. Numerous deep learning methods are employed in research on kidney disease classification. The renal ultrasound pictures are enhanced with a

median filter, a Gaussian filter, and morphological operations in the article23, and then characteristics from the images are retrieved with Principal Component analysis (PCA) and the

K-nearest neighbor (KNN) classifier. The authors in24 evaluated different traditional ML algorithms, such as Decision Trees (DT), Random Forest (RF), Support Vector Machines (SVM),

Multilayer Perceptron (MLP), K-Nearest Neighbor (KNN), Naive Bayes, and deep neural networks using Convolutional Neural Network (CNN) and got the highest F1 score of 0.853. In25, pre-trained

DNN models such as ResNet-101, ShuffleNet, and MobileNet-v2 are used to extract features from kidney ultrasound pictures, which are then classified using a SVM, with final predictions made

using the majority voting technique. The authors used ultrasound images there for classification problem and got the highest accuracy of 95.58%. The residual dual-attention module (RDA

module) was employed for the segmentation of renal cysts in CT images in26. In27, the authors integrated the features of using conventional and deep transfer learning techniques, and

finally, features are used by the SVM Classifier to classify normal and abnormal images using US images. In28, two CNN models are used consecutively, where the first CNN was used to identify

the urinary tract, and the second CNN to detect the presence of stone and got 95% accuracy. An automated detection of kidney stones (i.e., having/not having stone) was proposed in29 using

coronal Computed Tomography (CT) images and a deep learning technique, yielded a detection accuracy of 96.82%. The authors used 1,799 images there in total to train and validate the model.

The authors in30 proposed two morphology convolution layers, modified feature pyramid networks (FPNs) in the faster RCNN and combined four thresholds. They got an area under the curve (AUC)

value of 0.871. The kidney cyst image detection system for abdominal CT scan images using a fully connected CNN was developed in31 and the authors got a true-positive rate of 84.3%. In

summary, the efforts utilizing machine learning32 and deep learning33 approaches to classify a few kidney radiological findings have provided promising results, but the majority of the

tasks, we found are performed on xray or ultrasound images.A few approaches were there with CT scan images only with dual class classification. Considering the scarcity of data and the above

findings of research articles, we created a database of kidney stone, cyst and tumor CT images. We implemented three deep learning techniques (VGG16, Inceptionv3 and Resnet50) to classify

four classes of kidney disease and demystified the blackbox of the models to show why our model came to a certain conclusion about a class. We also implemented the latest state-of-the-art

innovations in vision learning (EANet, CCT, and Swin transformer algorithms) to classify the four classes and have shown that our model has promising accuracy which can reduce the suffering

of the world population through early diagnosis of diseases. METHODOLOGY We first collected and annotated the datasets to create a database for Kidney Stone, Tumor, Normal, and Cyst

findings. Data augmentation, image scaling and normalization, and data splitting are among the preprocessing techniques utilized. After that, we employed six models to investigate our data,

including three Visual Transformer variants (EANet, CCT, and Swin Transformer), Inception v3, and Vgg16 and Resnet 50. The model’s performance was evaluated using previously unseen data. The

Block contains details about our experiment’s diagram can be found in Fig. 1 The methodology is presented in this part in the following order: dataset description, image preprocessing,

neural network models, and evaluation strategies of the experiments. DATASET DESCRIPTION The dataset was collected from PACS (Picture archiving and communication system) and workstations

from a hospital in Dhaka, Bangladesh where patients were already diagnosed with having a kidney tumor, cyst, normal or stone findings. All subjects in the dataset volunteered to take part in

the research experiments, and informed consents were obtained from them prior to data collection. The experiments and data collection were pre-approved by the relevant hospital authorities

of Dhaka Central International Medical College and Hospital (DCIMCH). Besides, the data collection and experiments were carried out in accordance with the applicable rules and regulations.

Both the Coronal and Axial cuts were selected from both contrast and non-contrast studies with protocol for the whole abdomen and urogram. The Dicom study was then carefully selected, one

diagnosis at a time, and from those we created a batch of Dicom images of the region of interest for each radiological finding. Following that, we excluded each patient’s information and

meta data from the Dicom images and converted the Dicom images to a lossless joint photographic expert group (jpeg/jpg) image format. The Philips IntelliSpace Portal 9.034 application is

used for data annotation, which is an advanced image visualization tool for radiology images, and the Sante Dicom editor tool35 is used for data conversion to jpg images, which is primarily

used as a Dicom viewer with advanced features to assist radiologists in diagnosing specific disease findings. After the conversion and annotation of the data manually, each image finding was

again verified by a doctor and a medical technologist to reconfirm the correctness of the data. Our created dataset contains 12,446 unique data within it in which the cyst contains 3,709,

normal 5,077, stone 1,377, and tumor 2,283. The dataset was uploaded to Kaggle and made publicly available so that other researchers could reproduce the result and further analyze it. Figure

2 depicts a sample selection of our datasets. The red marks represent the finding area or region of interest that a radiologist uses to reach a conclusion for specific diagnosis classes.

Figures 3 and 4 show the image color mean value distribution and the image color mean value distribution by four classes for our dataset respectively. From both these distributions, it can

be concluded that the whole dataset is very similar to the distribution of individual normal, stone, cyst, and tumor images. The mean and standard deviation of the image samples plot show

that most of the images are centered, whereas stones and cysts have lower mean and standard deviation which can be visualized in Fig. 5. Since the data distributions of different renal

disease classes are partially overlapped therefore, classification of cyst, tumor, and stone is not possible using only analyzing the statistical features. IMAGE PROCESSING After converting

DICOM images into jpg images, we scaled the images as per the standard size requirement of neural network models. For all the transformer variant algorithms, we resized each image to 168 by

168 pixels. Images for Inception v3 were resized to 299 by 299 pixels, while images for VGG16 and Resnet were reduced to 224 by 224 pixels.We then randomized all the images and took 1,300

examples of each diagnosis for the models’ consideration to avoid data imbalance problems, as we have 1,377 images available for the kidney stone category. The rotation operation for image

augmentation was performed by rotating the images clockwise at an angle of 15 degrees. We evaluated all the models using a scheme where 80% of the images were taken to train the model and

20% to test the data. Within 80% of the training images, we took 20% to validate the model to avoid overfitting. The dataset is normalized using Z-normalization36 using following (1):

$$\begin{aligned} {\hat{X}} = \frac{X[:i]-\mu _i}{\sigma _i} \end{aligned}$$ (1) Here, $\mu _i$ is the mean and $\sigma _i$ is the standard deviation value of the feature. TRANSFER

LEARNING BASED NEURAL NETWORK MODELS From the dataset, i. e., the CT KIDNEY DATASET: Normal-Cyst-Tumor and Stone, we randomly chose 1300 images of each class and trained our six models. All

the neural network models were trained on Google Colab Pro Edition with 26.3 GB of GEN Ram and 16160 MB of GPU RAM using Cuda version 11.2. All the models were trained with a batch size of

16 and up to 100 epochs. VGG16 In our experiment, the 16-layer VGG 1637 model was tweaked in the last few layers by using the first 13 layers of the original VGG16 model, and we added

average pooling, flattening, and a dense layer with a relu activation function. A dropout and finally another dense layer is added to classify the normal kidney as well as cysts, tumors, and

stones. The total number of parameters in our modified VGG16 is 14,747,780, out of which 4,752,708 are the trainable parameters and 9,995,072 are the non-trainable parameters. Table 1 shows

the number of parameters of the different models used in our study. RESNET50 To avoid the vanishing gradient problem, and performance degradation of deep neural networks, skip connections

are being used in the original Resent model. We utilized 50-layer resnet5014 models and modified them as the same as the Vgg16 and Inception v3 layers in the final few layers to achieve the

classification task. The total number of parameters in our modified Resnet 50 model is 23,719,108. Trainable and nontrainable parameters are 135,492 and 23,583,616 respectively. INCEPTION V3

A variant of the Inception family neural network, Inception v3 based on Depthwise Separable Convolutions, is used in our study to classify images. Similar to VGG 16, we modified the

original Inception v315 model in the last few layers, by keeping all the layers except the last three. We added average pooling, flattening, a dense layer, a dropout, and finally a dense

layer to do the classification task. The total number of parameters in inception v3 is 22,327,396 with 524,612 trainable parameters. The total number of non-trainable parameters is

21,802,784. TRANSFORMER BASED MODELS EXTERNAL ATTENTION TRANSFORMER(EANET) Though the transformer-based models were popular in Natural Language Processing, the recent advent of the vision

transformer is gaining popularity over time, which utilizes the transformer architecture that uses self-attention to sequences of image patches18. The sequence of image patches is the input

to the multiple transformer block in this case, which uses the multihead attention layer as a self-attention mechanism. A tensor of batch_size, num_patches, and projection_dim is produced by

transformer blocks, which may subsequently be passed to the classifier head using softmax to generate class probabilities. One variant of the Vision Transformer EANet is shown in Fig. 6.

EANet20 utilizes external attention, based on two external, small, learnable, and shared memories, $M_k$ and $M_v$. The purpose of EANet is to drop patches that contain redundant and

useless information and hence improve performance and computational efficiency. External attention is implemented using two cascaded linear layers and two normalization layers. EANet

computes attention between input pixels and external memory unit via following formulas (2), (3) $$\begin{aligned} \mathrm {A} =Norm\left( \mathrm {F}\mathrm {M}_\mathrm {k}^\mathrm {T}

\right) \end{aligned}$$ (2) Finally, input features are updated from $M_v$ by the similarities in Attention A. $$\begin{aligned} \mathrm {F}_\mathrm {out} =\mathrm {A}\mathrm {M}_\mathrm

{v} \end{aligned}$$ (3) We utilized TensorFlow Addons packages to implement EANet. After doing data augmentation with random rotation at scale 0.1, random contrast with a factor of 0.1, and

random zoom with a height and width factor of 0.2, we implemented the patch extraction and encoding layer. Following that, we implemented an extraneous attention block, and transformer

block. The output of the transformer block is then provided to the classifier head to produce class probabilities to calculate the probabilities of kidney normality, stone, cyst, and tumor

findings. COMPACT CONVOLUTIONAL TRANSFORMER(CCT) Convolution and transformers are combined on CCT to maximize the benefits of convolution and transformers in vision. Instead of using non

overlapping patches, which are used by the normal vision transformer in CCT21, the convolution technique is used where local information is well-exploited. Figure 7 illustrates the CCT

procedure. CCT is run using TensorFlow Addons, where first data is augmented using random rotation at scale 0.1, random contrast with a factor of 0.1, and random zoom with a height and width

factor of 0.2.To avoid gradient vanishing problems in CCT, a stochastic depth38 regularization technique is used, which is very much similar to dropout except, in stochastic depth, a set of

layers is randomly dropped. In CCT, In CCT, after doing convolution tokenization, data is fed to a transformer encoder and then sequence pooling. Following the sequence pooling MLP head

gives the probabilities of different classes of the kidney diagnosis. The total number of parameters in our proposed CCT model has 407,365 parameters and all the parameters are trainable.

SHIFTED WINDOW TRANSFORMERS (SWIN TRANSFORMERS) Another variant of the Vision Transformer is the Swin Transformer22, which is another powerful tool in computer vision. Detailed block diagram

of the Swin transformer is shown in Fig. 8. In the picture, we can see four unique building blocks. First, the input image is split into patches by the patch partition layer. The patch is

then passed to the linear embedding layer and the swin transformer block. The main architecture is divided into four stages, each of which contains a linear embedding layer and a swin

transformer block multiple times. The Swin transformer is built on a modified self-attention and a block that includes multi-head self-attention (MSA), layer normalization (LN), and a

2-Layer Multi-Layer perceptron (MLP). In this paper, we utilized the swin transformer to tackle the classification problem and diagnose kidney cysts, tumors, stones, and normal findings.

PERFORMANCE EVALUATION METHODS The quantitative evaluation of all the six models is calculated based on the parameters of accuracy, sensitivity or recall, precision, or PPV. True

positive(_TP_), false positive(_FP_), true negative(_TN_), and false negative(_FN_) samples are used to calculate the accuracy (4), precision (5), sensitivity (6) . The recall, also known as

sensitivity, is the model’s ability to identify all relevant cases within a data set. The number of true positives is divided by the number of true positives plus the number of false

negatives. It refers to the study’s capability to appropriately identify sick patients with the disease. Diseases are frequently defined as a positive category in medical diagnosis. Omitting

this (positive category) has serious consequences, such as misdiagnosis, which can lead to patient treatment delays. As a result, high sensitivity or recall is critical in medical image

diagnosis. Precision (PPV) is necessary when out of all the examples that are predicted as positive, if we desire to know how many are really positive. With precision, the number of true

positives is divided by the number of true positives plus the number of false positives. High precision is desired in the medical imaging domain. The F1 score (7) of all the models is

calculated by using those models’ sensitivity and precision. The following formulas are applied to accuracy, precision, sensitivity, and F1 score. $$\begin{aligned}&\mathrm

{Accuracy}_\mathrm {i} =\frac{\mathrm {TP}_\mathrm {i} +\mathrm {TN}_\mathrm {i}}{\mathrm {TP}_\mathrm {i} +\mathrm {TN}_\mathrm {i}+\mathrm {FP}_\mathrm {i} +\mathrm {FN}_\mathrm {i}}

\times 100 \% \end{aligned}$$ (4) $$\begin{aligned}&\mathrm {Precision}_\mathrm {i} =\frac{\mathrm {TP}_\mathrm {i}}{\mathrm {TP}_\mathrm {i} +\mathrm {FP}_\mathrm {i}} \end{aligned}$$

(5) $$\begin{aligned} \mathrm {Sensitivity}_\mathrm {i} =\frac{\mathrm {TP}_\mathrm {i}}{\mathrm {TP}_\mathrm {i} +\mathrm {FN}_\mathrm {i}} \end{aligned}$$ (6) $$\begin{aligned}&\mathrm

{F1\_score}_\mathrm {i} =2\times \frac{\mathrm {Precision}_\mathrm {i} \times \mathrm {Sensitivity}_\mathrm {i}}{\mathrm {Precision}_\mathrm {i} +\mathrm {Sensitivity}_\mathrm {i}}

\end{aligned}$$ (7) Where, * i=Kidney Tumor or Cyst or Normal or Stone class for the classification task. * TP= True Positive * FN= False Negative. * TN=True Negative Furthermore, we plotted

a receiver operating characteristic (ROC) curve with the transverse axis being the false positive rate (FPR) and the longitudinal axis being the true positive rate (TPR). The AUC, or area

under the ROC curve, measures the ROC curve’s ability to classify inputs. The higher the AUC, the better the classification capabilities of the model. The area under the curve is also

calculated for each developed model, and finally, all the models are compared to take a decision on which model is superior compared to other models. This paper used the gradient weighted

Class Activation Mapping (GradCAM)39 algorithm to make models more transparent by visualizing the input areas crucial for model predictions in the last convolution layers of CNN networks.

Figure 9 describes complete process for Gradcam analysis in our paper. First, we passed a picture through the model to get a prediction, and then we developed the image’s class prediction

based on the prediction value. After that, we computed the gradient of the class known as Feature Map activation $\hbox {A}^k$(8). $$\begin{aligned} \mathrm {A}^\mathrm {k} =

\frac{\partial y^c}{\partial A^k_{ij}} \end{aligned}$$ (8) These gradients flowing back are global-average-pooled across the width and height dimensions (indexed by i and j, respectively) to

calculate neuron significance weights (9). $$\begin{aligned} \mathrm {w}_\mathrm {k}^\mathrm {c} =\frac{1}{Z}\sum _{j} \sum _{i}\frac{\partial y^c}{\partial A^k_{ij}} \end{aligned}$$ (9)

Then neuron significance weights and feature map activations are summed and applied the Relu activation to the summed result to get the GradCam(10). $$\begin{aligned} \mathrm {L}_\mathrm

{Grad_CAM}^\mathrm {c} =ReLU\left( \sum _k\mathrm {w}_\mathrm {k}^\mathrm {c} \mathrm {A}^\mathrm {k}\right) \end{aligned}$$ (10) Where, * $\hbox {A}^k$= feature map activation *

${w}_\mathrm {k}^\mathrm {c}$= neuron significance weights We created a visualization by superimposing the original image with the heatmap. This visualization helps us to determine why our

model came to the conclusion that an image may belong to a certain class, like kidney tumor, cyst, normal, or stone. RESULT ANALYSIS The results of the implemented six models using

different tests are evaluated by calculating the accuracy, recall, F1 score (F1), accuracy (Acc), positive predictive value (PPV), and ROC curve area of interest (AUC) from unseen data. We

used Tenfold cross-validation and the result was averaged to produce the ROC curve, confusion matrix, and evaluation matrices. Table 2, Figs. 10 and 12 summarizes the performance of the six

networks studied in this paper. Figure 14 presents us with the gradcam analysis of the Inception v3, Resnet50, and Vgg16 models. Figure 12 provides the ROC curves for Transfer and

Transformer based models consecutively. Figures 10 and 12 shows the normalized Confusion Matrices for Transfer and Transformer based models consecutively. From the table 2, we can see that

the InceptionV3 model performed worse with our dataset and gave an accuracy of 61.60%. EANet and Resnet 50 performed moderately by giving accuracy of 77.02% and 73.80%. CCT, VGG16 and Swin

Transformers provided accuracy of 96.54%, 98.20% and 99.30% accuracy respectively. The Swin transformer, which is a transformer-based model, is outperforming all the other models in respect

of accuracy. The Swin Transformer is providing reasonable recall while detecting cyst, normal, stone, and tumor class images and providing a recall of 0.996, 0.981, 0.989, and 1

consecutively. Higher recall means there is the lowest chance of misdiagnosing the cyst, normal, stone, and tumor class images. From the table we can see, the Swin transformer is providing a

recall of 1 for kidney stone classes and it is good at detecting kidney tumor classes, whereas CCT is good at detecting stone class images and providing a recall of 1 for the stone class

images. However, for the other class images, recall for the CCT model is slightly lower than the Swin transformer model and provides a recall of 0.923, 0.975, and 0.964 for the cyst, normal,

and tumor class images, respectively. From the transfer learning based approaches, VGG16 provides a recall of 0.968, 0.973, 0.988, and 0.996 respectively for Kidney Cyst, Normal, Stone, and

Tumor class images. But Inception v3 and Resnet are providing lower recall for all the classes. The recall for the Kidney Tumor class is 0.295 for the Inception v3 model and 0.462 for the

Kidney Stone classes. This means the Resnet model in our study is the least effective at detecting kidney tumors and kidney stones. Since in medical image diagnosis recall is a priority

matrix to consider, a model built based on Resnet and Inception v3 can’t be used in diagnosis in our case. From the transformer based model, we can see in the table 2 the precision is

highest for the Swin transformer model and provides 0.996, 0.996, 0.981, and 0.993 respectively for Kidney Cyst, Normal, Stone, and Tumor class images. From the transfer based approach, we

can see VGG16 is providing better precision than Inception V3 and Resnet50. For the cyst, normal, stone, and tumor classes, the highest F1 score is provided by the swin transformer also, and

the numbers are 0.996, 0.998, 0.985, and 0.996 consecutively. The Swin transformer also provides the highest precision for Stone and Tumor classes, and readings are 0.981 and 0.993. For the

cyst class, the Swin transformer and VGG 16 are providing the same value of 0.996, whereas for the normal class, the Swin transformer is performing better and giving a reading of 0.996.

Considering the above, the Swin transformer is superior and outperforms all the models, and can be of great use in kidney medical imaging diagnosis. From Figs. 10, 11, 12 and 13, we can see

that the Area Under the ROC Curve is superior in the case of CCT, VGG16, and SWin Transformers than Resnet50, EANet, and Inception v3. AUC is closer to 1 while diagnosing Kidney Cyst,

Normal, Stone, and Tumor categories for Swin Transformers, CCT, and VGG16 models. Considering precision, recall, and F1 Score, we can conclude that though VGG16 and CCT are performing well,

the Swin transformer outperformed all the models. Though CCT and VGG16 can be used while diagnosing kidney stones, cysts, and tumors, Swin Transformer can be considered the most effective

option. After randomly providing four images of different classes from the CT machine in the GradCam algorithm, we analyzed the GradCam of the last convolution layer of the Transfer-based

algorithm. From the Fig. 14, First row shows images that contain cysts. We can see from the Fig. 14a, e and i that VGG16 is watching a very small region (high level features) to take a

decision about cyst class images, whereas Resnet50 and Inceptionv3 are looking at more dispersed regions, hence low-level features to classify. For the stone class images Fig. 14c, g and k,

we can observe that Vgg 16 is watching the region of interest perfectly. Other models are watching dispersed regions, whereas VGG16 is watching a very small region to make a decision. A

similar condition applies to the tumor and normal classes as well. In our case, VGG16 is predicting all the images as correct class and watching the region of interest perfectly, whereas

Resnet is predicting normal findings such as tumors and stones as normal in this case and also not watching where the model should watch to make a decision. Inception V3 is also not watching

the region of interest perfectly and watching more low-level features, and in this case, it predicated the tumor class as the normal class. CONCLUSION For this work, we collected and

annotated a total of 12,446 whole abdomen and urogram CT scan images containing cysts, tumors, normal, and stone findings. Exploratory data analysis of the images was performed and showed

that the images from all the classes had the same type of mean colour distribution. Furthermore, this study has developed six models and out of which, three models are based on recent

state-of-the-art variants of the Vision transformers EANet, CCT, and Swin transformers, and the other three are based on popularly known deep learning models, Resnet, Vgg16, and Inception

v3, which are tweaked in the last few layers. A comparison of all the models performed revealed that, while VGG16 and CCT performed well, the Swin transformer outperformed all the models in

terms of accuracy, providing an accuracy of 99.30%. The F1 score, precision, and recall comparisons provide evidence that the Swin transformer is outperforming all the models.Besides,

compare to all the models, the Swin transformer has taken less time to train with the same number of epochs. The study has also tried to reveal the blackbox of VGG16, Resnet50, and Inception

models and found that the VGG16 model is better compare to Resnet50 and Inceptionv3 by showing the desired abnormalities in the anatomy better. We believe the superior accuracy of our model

based on the Swin transformer and the VGG16-based model can both be of great use in detecting kidney tumors, cysts, and stones, and can reduce the pain and suffering of patients. REFERENCES

* Jacobson, S. Chronic kidney disease-a public health problem?. _Lakartidningen_ 110(21), 1018–1020 (2013). Google Scholar * Jha, V. _et al._ Chronic kidney disease: global dimension and

perspectives. _The Lancet_ 382(9888), 260–272 (2013). Article Google Scholar * Foreman, K. J. _et al._ Forecasting life expectancy, years of life lost, and all-cause and cause-specific

mortality for 250 causes of death: reference and alternative scenarios for 2016–40 for 195 countries and territories. _The Lancet_ 392(10159), 2052–2090 (2018). Article Google Scholar *

Rediger, C. _et al._ Renal cyst evolution in childhood: a contemporary observational study. _J. Pediatric Urol._ 15(2), 188-188e1 (2019). Article Google Scholar * Brownstein, A. J. _et

al._ Simple renal cysts and bovine aortic arch: Markers for aortic disease. _Open Heart_ 6(1), e000862 (2019). Article Google Scholar * Sanna, E. _et al._ Fetal abdominal cysts: Antenatal

course and postnatal outcomes. _J. Perinatal Med._ 47(4), 418–421 (2019). Article Google Scholar * Alelign, T. & Petros, B. Kidney stone disease: an update on current concepts. _Adv.

Urol._ 2018 (2018). * Hsieh, J. J. _et al._ Renal cell carcinoma. _Nat. Rev. Dis. Primers_ 3(1), 1–19 (2017). Article Google Scholar * Saw, K. C. _et al._ Helical CT of urinary calculi:

Effect of stone composition, stone size, and scan collimation. _Am. J. Roentgenol._ 175(2), 329–332 (2000). Article CAS Google Scholar * Gunasekara, T. _et al._ Urinary biomarkers

indicate pediatric renal injury among rural farming communities in sri lanka. _Sci. Rep._ 12(1), 1–13 (2022). Article Google Scholar * Bi, Y., Shi, X., Ren, J., Yi, M. & Han, X.

Transarterial chemoembolization of unresectable renal cell carcinoma with doxorubicin-loaded callispheres drug-eluting beads. _Sci. Rep._ 12(1), 1–8 (2022). Article Google Scholar * Sozio,

S.M., Pivert, K.A., Caskey, F.J. & Levin, A. The state of the global nephrology workforce: A joint asn–era-edta–isn investigation. _Kidney Int._, (2021). * Islam, M. CT kidney dataset:

Normal-cyst-tumor and stone 2021. [Online]. Available: https://www.kaggle.com/nazmul0087/ct-kidney-dataset-normal-cyst-tumor-and-stone. * He, K., Zhang, X., Ren, S. & Sun, J. Deep

residual learning for image recognition. in _Proceedings of the IEEE conference on computer vision and pattern recognition_, 2016, pp. 770–778. * Szegedy, C., Liu, W., Jia, Y., Sermanet, P.,

Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A. Going deeper with convolutions. in _Proceedings of the IEEE conference on computer vision and pattern recognition_, 2015,

pp. 1–9. * Chollet, F. Xception: Deep learning with depthwise separable convolutions. in _Proceedings of the IEEE conference on computer vision and pattern recognition_, pp. 1251–1258

(2017). * Tan, M., & Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. in _International Conference on Machine Learning_. PMLR, 2019, pp. 6105–6114. *

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S. _et al._ An image is worth 16x16 words:

Transformers for image recognition at scale. _arXiv preprint_arXiv:2010.11929, (2020). * Kolesnikov, A. _et al._ 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part

V 16. _Springer_ 2020, 491–507 (2020). * Guo, M.-H., Liu, Z.-N., Mu, T.-J. & Hu, S.-M. Beyond self-attention: External attention using two linear layers for visual tasks. _arXiv

preprint_arXiv:2105.02358, (2021). * Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J. & Shi, H. Escaping the big data paradigm with compact transformers. _arXiv

preprint_arXiv:2104.05704, (2021). * Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S. & Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows.

_arXiv preprint_arXiv:2103.14030, (2021). * Verma, J., Nath, M., Tripathi, P. & Saini, K. Analysis and identification of kidney stone using k th nearest neighbour (knn) and support

vector machine (svm) classification techniques. _Pattern Recognit. Image Anal._ 27(3), 574–580 (2017). Article Google Scholar * AKSAKALLI, I., KAÇDIOĞLU, S., & HANAY, Y.S. Kidney x-ray

images classification using machine learning and deep learning methods. _Balkan J. Electr. Comput. Eng._ 9(2), 44–551. * Sudharson, S. & Kokil, P. An ensemble of deep neural networks

for kidney ultrasound image classification. _Comput. Methods Progr. Biomed._ 197, 105709 (2020). Article CAS Google Scholar * Fu, X., Liu, H., Bi, X. & Gong, X. Deep-learning-based CT

imaging in the quantitative evaluation of chronic kidney diseases. _J. Healthcare Eng._ (2021). * Zheng, Q., Furth, S. L., Tasian, G. E. & Fan, Y. Computer-aided diagnosis of congenital

abnormalities of the kidney and urinary tract in children based on ultrasound imaging data by integrating texture image features and deep transfer learning image features. _J. Pediatric

Urol._ 15(1), 75-75e1 (2019). Article Google Scholar * Parakh, A. _et al._ Urinary stone detection on CT images using deep convolutional neural networks: evaluation of model performance

and generalization. _Radiol.: Artif. Intell._ 1(4), e180066 (2019). Google Scholar * Yildirim, K. _et al._ Deep learning model for automated kidney stone detection using coronal CT images.

_Comput. Biol. Med._ 104569 (2021). * Zhang, H. _et al._ Automatic kidney lesion detection for CT images using morphological cascade convolutional neural networks. _IEEE Access_ 7, 83 001-83

011 (2019). Article Google Scholar * Blau, N. _et al._ Fully automatic detection of renal cysts in abdominal CT scans. _Int. J. Comput. Assisted Radiol. Surg._ 13(7), 957–966 (2018).

Article Google Scholar * Siddiqi, M. H., Alam, M. G. R., Hong, C. S., Khan, A. M. & Choo, H. A novel maximum entropy markov model for human facial expression recognition. _PloS one_

11(9), e0162702 (2016). Article Google Scholar * Munir, M.S., Abedin, S.F., Alam, M.G.R., & Hong, C.S. et al. Rnn based energy demand prediction for smart-home in smart-grid framework.

pp. 437–439, (2017). * Healthcare, P. Radiology and cardiology diagnostic imaging solution | philips healthcare. (2022). [Online]. Available:

https://www.usa.philips.com/healthcare/product/HC881072/intellispace-portal-advanced-visualization-solution. * LTD, S. Sante dicom viewer pro | santesoft ltd. 2022. [Online]. Available:

https://www.santesoft.com/win/sante-dicom-viewer-pro/sante-dicom-viewer-pro.html. * Patro, S., & Sahu, K.K. Normalization: A preprocessing stage. _arXiv preprint_arXiv:1503.06462,

(2015). * Simonyan, K., & Zisserman, A. Very deep convolutional networks for large-scale image recognition. _arXiv preprint_arXiv:1409.1556, (2014). * Huang, G., Sun, Y., Liu, Z., Sedra,

D., & Weinberger, K.Q. Deep networks with stochastic depth. in _European conference on computer vision_. Springer, 2016, pp. 646–661. * Selvaraju, R.R., Cogswell, M., Das, A., Vedantam,

R., Parikh, D. & Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. in _Proceedings of the IEEE international conference on computer vision_,

2017, pp. 618–626. Download references FUNDING Open access funding provided by Norwegian University of Science and Technology. AUTHOR INFORMATION AUTHORS AND AFFILIATIONS * Department of

Computer Science and Engineering, BRAC University, Dhaka, Bangladesh Md Nazmul Islam & Md. Golam Rabiul Alam * Radiology & Imaging Technology, Bangladesh University of Health

Sciences, Dhaka, Bangladesh Mehedi Hasan * Department of Nephrology, Bangabandhu Sheikh Mujib Medical University, Dhaka, Bangladesh Md. Kabir Hossain * Software and Service Innovation,

SINTEF Digital, Oslo, Norway Md Zia Uddin * Department of Computer Science, Norwegian University of Science and Technology, Gjøvik, Norway Ahmet Soylu Authors * Md Nazmul Islam View author

publications You can also search for this author inPubMed Google Scholar * Mehedi Hasan View author publications You can also search for this author inPubMed Google Scholar * Md. Kabir

Hossain View author publications You can also search for this author inPubMed Google Scholar * Md. Golam Rabiul Alam View author publications You can also search for this author inPubMed

Google Scholar * Md Zia Uddin View author publications You can also search for this author inPubMed Google Scholar * Ahmet Soylu View author publications You can also search for this author

inPubMed Google Scholar CONTRIBUTIONS M.N.I. and M.G.R.A. contributed to design the novel idea, experimental results, and initial draft of the paper. M.H. and M.K.H. contribued with

collecting and validating the data of the datasets for the experiments. M.Z.U. and A.S. contributed in revising and reviewing the idea, paper and results from the experiments, and

coordinated the overall process and study. CORRESPONDING AUTHOR Correspondence to Ahmet Soylu. ETHICS DECLARATIONS COMPETING INTERESTS The authors declare no competing interests. ADDITIONAL

INFORMATION PUBLISHER'S NOTE Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. RIGHTS AND PERMISSIONS OPEN ACCESS

This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as

long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third

party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the

article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the

copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. Reprints and permissions ABOUT THIS ARTICLE CITE THIS ARTICLE Islam, M.N., Hasan, M.,

Hossain, M.K. _et al._ Vision transformer and explainable transfer learning models for auto detection of kidney cyst, stone and tumor from CT-radiography. _Sci Rep_ 12, 11440 (2022).

https://doi.org/10.1038/s41598-022-15634-4 Download citation * Received: 25 December 2021 * Accepted: 27 June 2022 * Published: 06 July 2022 * DOI: https://doi.org/10.1038/s41598-022-15634-4

SHARE THIS ARTICLE Anyone you share the following link with will be able to read this content: Get shareable link Sorry, a shareable link is not currently available for this article. Copy

to clipboard Provided by the Springer Nature SharedIt content-sharing initiative