Finding Lung-Diseases in Chest X-Rays Using Deep Learning

Chest X-rays are a crucial tool for diagnosing diseases like Cardiomegaly, Pneumothorax, Effusion and many more. But while doctors excel at reading these images, the process is often time-consuming and can be prone to error, especially in busy hospital settings. This project aimed to develop a deep learning model that can automate the detection of diseases. By leveraging a dataset of labeled X-ray images (along with bounding box annotations), the goal was to create a model that not only identifies diseases but also highlights them visually on the image.

We used TensorFlow, pyTorch and various deep learning techniques, training the model on both disease classification and object localization tasks. Here's how we approached the problem and what we learned.

Methodology

(1) Data Preprocessing

The dataset used in this project was obtained from the National Institutes of Health (NIH) and consists of over 100,000 chest X-ray images. These images cover 14 lung diseases as well as a "No Finding" category. The dataset was then divided into two parts: training data (approx. 80%) and validation data (approx. 20%). The information about the diseases was stored in two CSV files: one containing disease annotations and the other containing bounding box coordinates. This step was followed by normalizing the pixel values of the images to fit the RGB-Scale. For the disease labels, we created a multi-label classification system, because many images showed more than one disease. Then we used generators to efficiently load the images in batches (size: 32-64) during training, which helped manage our limited memory capacities.

(2) Model Architecture

The model architecture was based on a Convolutional Neural Network (CNN), a common approach for image classification tasks. The network was designed to handle both multi-label classification (predicting the presence of multiple diseases) and regression (predicting the bounding box coordinates for each detected disease). To balance the dual objectives of classification and bounding box prediction, we used a custom loss function that combined binary cross-entropy (for classification) with mean squared error (for bounding box regression). The model architecture included several convolutional layers followed by dense layers that output the predicted bounding boxes and disease probabilities.

(3) Training & Optimization

Data augmentation techniques such as rotation and shifts were explored, but not fully implemented in order to keep the initial pipeline simple. Throughout the training process, we focused on optimizing the model by minimizing the loss function. For disease classification, we used binary cross-entropy loss, which proved to be the most effective for this task. The model’s objective was to predict the probabilities for each disease class present in the images. For each validation image, the model generated a probability distribution across all possible diseases, allowing us to determine which disease was most likely to be present. These predictions were then compared with the ground truth disease annotations, which represented the actual diseases associated with each image.

(4) Evaluation and Visualization

After training, we evaluated the model using images from the validation set. We used two primary metrics: the training / validation accuracy and the training / validation loss. The training accuracy / loss measures how well the model fits the data during the training process, while the validation accuracy / loss helps assess how well the model performs on the validation set, which is crucial for detecting overfitting.

The results were promising but also highlighted areas for improvement. On average, the model achieved an accuracy of around X% in detecting the presence of diseases... (Comparison between ResNet50, EfficientNet….)

ResNet50:
1. Training Time per Epoch: ~35 minutes
2. Accuracy:
  Training: Remains constant at ~86% after 15 epochs.
  Validation: Remains not constant at ~70- 80% after 15 epochs.
3. Validation Loss: Decreases from 84% to 42% after 16 epochs.
4. Training Loss: Decreases from 71% to 38% after 16 epochs.
EfficientNet:
1. Training Time per Epoch: ~15 minutes
2. Accuracy:
  Training: Remains constant at ~90% after 16 epochs.
  Validation: Remains constant at ~88% after 11 epochs.
3. Validation Loss: Decreases from 63% to 42% after 16 epochs.
4. Training Loss: Decreases from 81% to 35% after 16 epochs.

Training (~69%) and validation (~83%) accuracy start high and remain stable, while losses decrease significantly. This suggests robust training without overfitting, even with limited data. Accuracy is high from the beginning of training to end (~90% for training, ~88% for validation), while Train- and Validation Loss are relatively low and decrease steadily. These parameters remain nearly constant throughout training, suggesting stable learning without overfitting. However, these metrics (accuracy and loss) may not fully capture performance in a multi-class classification task with potential class imbalances, as seen in this project. A subset of the full dataset was used (30,000+ images out of 112,000+ total) due to computational and memory constraints. Some diseases are represented by fewer images, which complicates the calculation of class weights for CNN to handle class imbalance effectively. To optimize training speed, layers were frozen, significantly reducing epoch time (~15 minutes) without substantially impacting model performance.

ConvNeXt:
1. One Epoch takes approximately 20 h of Training time
2. Accuracy after 7 epochs is constant at about 93% over all epochs
3. Train Loss after 7 epochs decreases from 17.7% to 11.8%
4. Valid Loss after 7 epochs decreases from 18.2% to 17.5%
5. Mean ROC AUC increases constantly from 0.798 to 0.896
6. Mean AP increases constantly from 0.243 to 0.578
7. Overall bias decreases constantly from 4.6% to 2.6%

Accuracy is relatively high from the beginning of the training while Train- and Valid-Loss are relatively low. These parameters are nearly constant for the whole training. This maybe indicates a stable training without overfitting. Maybe these metrics are not ideal to estimate the performance for a classification problem with many classes like the problem discovered in this project. The Mean ROC AUC is more sensitive to diseases which are underrepresented. The constant increase of Mean ROC AUC and Mean AP and the decrease of the overall bias indicates that the model is constantly learning to discriminate diseases in images.

Here’s what we learned from this project:

Data preprocessing is crucial: One of the most important lessons we learned was that well-prepared data is the foundation of any successful machine learning model. If the input data is not properly formatted, the model’s performance will suffer. Taking extra time to ensure the data is correctly loaded and processed made a noticeable difference in the results
Small details make a big difference: Many of the biggest issues we encountered could be traced back to small, seemingly insignificant details. A misplaced variable or a minor mistake in how we processed the data could cause the model to misbehave. Paying attention to these small aspects helped improve the overall performance and saved us a lot of troubleshooting time.

Challenges & Future Work: While the results were promising, we faced a few challenges:

Multiple diseases in one image: The model occasionally struggles to accurately detect multiple diseases within a single image. This remains an area for improvement and is actively being worked on.
Limitation of the dataset: The model is trained only to detect the 14 diseases included in the dataset. If an X-ray image shows a condition not represented in the training set, the model might incorrectly classify it as "No Finding" with high confidence.
Class imbalances: Some diseases were significantly more common in the dataset than others, leading to an imbalance. Although we attempted to address this with class weighting and by focusing on underrepresented diseases, further refinement is necessary to improve model performance on rare conditions.

Conclusion

This project was a great experience in applying deep learning to medical image analysis. The model showed promising results in detecting diseases and predicting their locations, but there’s still room for improvement. By refining the architecture and improving the handling of multiple diseases, the model could be further enhanced.

Links to the project

https://github.com/fabianpeterkoch/NIH-Chest-X-Ray8-Classifier (ResNet50)
https://github.com/tomalla84/XRayLung (ConvNeXt)
https://github.com/SerhatKaraarslann/TechlabsProjekt.git (ResNet50 vs EfficientNetB0)