Eczema Classification: An Exploration of Building Classification Model on Medical Limited Unbalanced Data

1. Introduction

1.1 Background and Motivation

Eczema, also known as atopic dermatitis, is a very common chronic, inflammatory skin disease[5]. It has large global burden[18], and affected about 20% of children and 10% of adult in developed countries[21].

Traditionally, the diagnosis of eczema heavily relies on manual examination by dermatologists, and the diagnosis process involves classification between different eczema subtypes. An accurate classification and diagnosis is crucial in clinical treatment, since different subtype of eczema would need different methods for treatment. The accuracy and constancy of diagnosis therefore became important in clinic[22]. Furthermore, eczema may have similar appearance from other common skin diseases, the similarity causes potential misdiagnosis in clinic if doctors may not have received sufficient professional education in dermatology.

Over last decades, with the development of machine learning algorithms, automatic eczema diagnosis became possible. There are works that focus on building models for eczema diagnosis, including classification of eczema and other skin disease (e.g. acne), automatic severity evaluation and classification between eczema subtypes, among the recent works, deep learning algorithm is the most popular approach used[15].

1.2 Problem Statement

The problem we encountered is a multiclass image classification task. We are trying to build a model that classifies an eczema image into 7 subtypes. The visual diagnosis of eczema subtypes is challenging, caused by their high inter-class similarities. This project thus aims to develop a metric learning based software that classifies clinical skin images into specific categories, which acts as a tool to assist healthcare professionals, in a reliable, professional and a low-cost way, giving a faster and a more accurate diagnosis.

1.3 Objective and Scope

The objective of the project is to design, develop and evaluate software that leverages a metric learning tool to automatically classify images of dermatitis into subtypes. We aim to curate and preprocess clinical images, design and implement robust architecture, train and validate the model on a carefully designed dataset, and evaluate the model’s final performance on a separate test set, to imitate the non-desirous un-embalming conditions.

The project will and only will include disease types, input data, technical focus and output. This dossier will show the challenges, methods and experiments in a circumstantial (quasi-)scientific way. The research will not include definitive medical diagnosis – the final diagnosis must always be made by a medical professional. Treatment recommendations should be limited in the application, where patient management would definitely not be included in any form in the application. No online storage will be involved.

2. Related Work

2.1 Traditional Diagnosis Method

Traditionally, dermatology diagnosis was all done by hand. The medical professional, usually licensed, examines the patient’s detailed patient history first, then conducts a thorough physical examination of the skin. Dermatologists access appearance, morphology and distribution of skin lesions, then give reliable convincing verdict and medical advice. There’s also standardized criteria to ensure consistency and objectivity for dermatitis. A universally recognized framework for atopic dermatitis includes the Hanifin and Rajka Criteria, and the UK Working Party Diagnostic Criteria for Atopic Dermatitis commonly used in the UK. Investigations like skin biopsy might also be necessary to rule out the common dermatitis in extreme cases.

2.2 Previous Automated Approaches

Previous work in automated eczema classification has evolved quite a lot into a more sophisticated hybrid framework. For example, traditional methods like SVMs and CNNs proved to have decent accuracy. DNNs have recently been integrated into dermatology. EczemaNet[16] achieved 96.2% accuracy with such methods as well.

3. Challenges in Eczema Classification

3.1 Data-Related challenges

Data is the fundamental part of machine learning models. For an image classification task in medical field, barriers in data related problems are the major challenge in building models. The failure in preparation of dataset would lead to severe problems in model building and hinder the construction of a robust model. Therefore, in our work, data has become the most important challenge to overcome and the bottleneck of our model’s performance.

The challenge arose from different aspects, including the limited data size, imbalanced data composition, patient privacy concerns, low data quality and the high cost of data annotation.

Firstly, since the annotation of eczema data requires trained experienced dermatologists to create reliable labels, the cost of labelling data is high, which indicates that creating our own dataset from sketch is unrealistic under our limited budget. As a result, the only way to prepare dataset is from online accessible annotated data, which further limits the amount of data we can attain.

Secondly, due to the privacy concerns of patients, not all images online can be used as our training data. In our work, we use only the open data that is publicly accessible. All licenses from our data sources explicitly grant permission to academic researchers’ usage, as it is directly mentioned in all sources that are used to train the aforementioned model. The watermarks aren’t removed from the images, ensuring our attribution for intellectual property.

Thirdly, due to the differences in the incidence of different eczema subtypes, the amount of data available varied largely over different subtypes. For instance, the most common subtype of eczema, Atopic Dermatitis (AD), has the most sufficient data, with over 1.5 thousand images available, however, for the minor subtypes, like Stasis Dermatitis, there are less than a hundred images available. The huge gap between the major and minor subtype shows severe data imbalance problem in our task. The imbalanced data, especially under a small size of dataset, is fatal for model building, which would lead to failure in model’s generalization ability and affect the convergence of model.

Lastly, since the source of data is open databases on internet, the quality of data is low, unreliable labelling like mismatch of labels and images would be common. Since our data size is already small, even a few wrong labelled images will lead to catastrophic result in model. Furthermore, the data collected from internet may include substantial noises, watermarks and irrelevant backgrounds would further hinder the model’s learning.

In conclusion, the data-related challenges are the most important challenges to overcome in our work. Very limited data size with imbalanced composition and low quality would be the obstacle in model’s learning. Thus, solving the data-related problems is the most crucial part of our work.

3.2 Technical Challenges

Apart from the challenges on building machine learning models, challenges from other aspects also exist, these technical challenges in eczema classification need to be solved as well to build a robust model.

Among different subtypes of eczema, there is high inter-class similarity between subtypes. In addition, the coexistence of multiple subtypes on one patient also increases the difficulty to correctly distinguish different eczema subtypes. Thus, the model built must be able to extract high dimensional features that can distinguish different subtypes.

The diversity of eczema subtypes’ appearance in different parts of the body and skin color also set a high requirement of generalization ability of model and data augmentation technique in training.

Furthermore, the technical challenges from the practical usage of the model, including the discrepancy in lighting of images, irrelevant noise in the input images and the skin tone of the patients would also affect the accuracy of diagnosis. Therefore, there is a high requirement in the model’s generalization ability.

The above technical challenges of eczema classification task are also the main challenges we need to resolve.

3.3 Clinical and Practical challenges

The special nature of our work as a clinical tool for dermatology diagnosis conditions is the need for a robust reliable model that can produce precise results.

Given the high imbalance in the incidence of different subtypes, the model should be able to have predict on all classes. Therefore, the model should not only take accuracy of prediction as the benchmark, but also the ability to predict the minor subtype as well. To gain confidence in clinical usage, the model should be able to produce information that is reliable and relevant to the patients’ situation.

4. Methodology

4.1 Data Collection

The data collection work takes the most time of our task. Due to the insufficiency in related data, collecting data is time consuming. There is no verified annotated dataset on internet, the few open datasets about eczema subtypes are occupied by mismatch labeling with extremely low quality that greatly affect the training of the model (see section 5 about the experiment data). Therefore, we have to collect our own dataset from different sources online to attain a capable dataset.

One of our main sources is the open altas on dermatology images, including DermNet, an open database that contains abundant dermatology clinical images, including eczema subtypes. In addition, we use key word on search engines, like Google, to attain more image for minor subtypes.

As a result, we attain a dataset with 2191 images across 7 subtypes. The dataset we attain has unavoidable severe class imbalance, where the largest class has 1152 images while the smallest class has only 77 images.

Figure 1, the composition of our dataset

4.2 Data Preprocessing and Augmentation

Due to the high requirement in model’s generalization ability in medical field, and the considerably low data quality in our dataset with severe imbalance problem and limited data size, data augmentation is necessary to build a robust model. Our approach was designed not merely to increase the volume of data, but to enhance the model's ability to learn invariant, generalizable features, thereby improving its performance on truly unseen images.

The dataset splitting is the fundamental step we have taken to ensure real model performance in our evaluations is provided for estimating the clinical performance of our model. The original 2191 images dataset was split into three isolated subsets before any preprocessing or augmentation, training set, validation set and test set, where only the training set is used in the model’s training, the validation set is used for evaluating model’s performance during training and tuning, the test set is only used once at last to have a picture of model’s final performance and no tuning would be done on the results in test set. Due to the low quality of our dataset images, the separation of different subsets must be done rigorously to gain realistic result in evaluation. Therefore, the validation subset and test subset are hold-out during the whole process of training. Crucially, while attempting to guarantee a sufficient amount of validation and test data for minor classes, we tried to ensure a similar distribution between the three subsets, although there is still a difference between the training and the evaluation sets due to the minimal need of images in minor classes to have a reliable evaluation.

Figure 2, the composition of three different subsets

For the preprocessing and augmentation of data, we resize the image to uniform size of 224x224 pixels and normalize it to fit in the base model. Furthermore, we applied on-fly augmentations to enhance the diversity and size of our model. The augmentations we used include geometric transformations and photometric transformations. For the geometric transformations, we applied horizontal and vertical flips, random zooming (up to 20%) and random rotations (up to 20%) to simulate the geometrical variation in clinical user input (including different orientations, different scaling or camera distances and different filming angles). For photometric transformations, we applied random brightness and contrast adjustment to simulate the wide range of possible lighting conditions in clinical environments.

4.3 Model Building

With the limitations in dataset, model architecture has become paramount importance in our whole development process. The primary challenge in this task is the severe class imbalance inherent in the eczema dataset. Our modeling approach was therefore driven by a systematic evaluation of techniques specifically designed to handle such data scarcity and distribution skew. We explored a wide range of methods, from traditional classification to advanced representation learning, ultimately selecting the approach that proved most effective for this specific problem.

4.3.1 Traditional Convolutional Neural Networks Approach (CNNs)

Firstly, we tried the most traditional approach when dealing with image classification tasks in the industry, using supervised convolutional neural networks (CNNs). It’s one of the most developed algorithms used in computer vison tasks since 2011; it uses multiple layers of convolutional kernels to extract features of input images to classify images. Modern CNN architectures, like Residual Neural Network (ResNet) or EfficientNet, have outstanding results in image classification tasks, often having accuracy beyond 90% in classification works. Intuitively, it has become our first choice to build our eczema model.

Considering the limited data size we have, a transfer learning on pre-trained model is necessary for the model’s convergence, otherwise, a dataset with merely 2k cardinality is impossible to train a model from sketch while learning the features of eczema subtypes well without overfitting in the training dataset. Therefore, we tried using residual neural network (ResNet) pre-trained on the ImageNet dataset as a solid backbone of our CNN models. Fine-tuning on pre-trained convolutional blocks and a classification header added on top of it (In our case ) with categorical cross-entropy loss at first. This approach is the most well-established, straightforward and powerful in dealing with most computer vison task with low computational resources needed. However, although using the CNN approach in our task might seem promising, and it did provide a solid result in terms of accuracy (see section 5), the inherent problem of eczema dataset, severe class imbalance, reveals the fact that using solely traditional approach with categorical cross-entropy loss is not enough. Since the majority types of our dataset constitute more than 75% of our dataset in size, the model can easily find a shortcut to give up the minor classes, thus leading to overfitting in model and catastrophic performance on minor classes, it can be seen from the low F1-score in the result. Therefore, in addition to the traditional CNN approach, machine learning techniques to prevent class imbalances must be used, Thus, we implement class weights and focal losses to prevent the model’s bias. These techniques tried to solve class imbalances by weighing the minor class more in training process. However, in our experiments, with using class weights or focal loss, the model fails from converging, this might be due to the severity of data imbalances in our dataset is too high so the model cannot learn well under a extremely high weight applied on minor classes. Therefore, we need to seek other approaches.

4.3.2 Generative Data Augmentation

Based on the traditional CNN approach, another method was introduced by using generative models to augment the dataset, which is by training image generative model to generate images of minor classes to fix the severe data imbalance in the dataset. In theory, a generative model can create arbitrary sized and distributed dataset, solving the inherent data problem in our dataset fundamentally. Therefore, by acquiring an balanced dataset from this approach, the traditional CNN approach can be used to further create a concrete classification model. The problem thus become a few-shot generative problem, which can be solved by using Generative Adversarial Network (GAN) with pre-trained weight. In our experiment, we use pre-trained state-of-art on small datasets, including FastGAN, StyleGAN2-ADA and ProjectedGAN. The result of the generated images is unsatisfactory, with clear distortion, overfitting or even mode collapse as a result. This is due to the extremely small amount of data for minority classes that cannot support the training of generative models even using pre-train SOTA models. Furthermore, training a generative model is computationally expensive, making it less viable under our limited budget.

4.3.3 Representation Learning Approaches

From above approaches, it can be clearly seen that the underlying primary problem of imbalanced limited data hindered the building of the model. We need to seek a method that deals with the class imbalances fundamentally, instead of trying to augment the existing model or data to fit in the imbalanced situation. Therefore, we change our mind to representation learning, a category of algorithms that transform the high-dimensional input (i.e. images) to a vector in a meaningful embedding vector space. There are two representation learning algorithms we considered in our case, self-supervised learning (SSL) and metric learning. The crucial ration behind these two approaches is that, instead of treating the dataset as an imbalance data with two majority classes take up over 75%, take it as a dataset with over 2000 image as a whole, thus, solves the severe imbalance problem. Representation learning uses the whole dataset to create an encoder map input to an embedding space where the classes are separate geometrically, making it naturally resilient to class imbalances rather than classify it directly. This makes the problem shift from classifying on a highly skewed dataset to creating an embedding space that from more than 2000 images. By doing this, although the impact of imbalance data still exists, the result would likely to be better than sole classification model.

Self-Supervised Learning:

SSL is an unsupervised learning algorithm that does not require labelling of data, its training is based on pairs of images in the dataset and augmented versions of the images, having a loss that rewards putting pair closer to each other but punishes further from other pairs in the embedding space. This method is very powerful in terms of training on unlabeled dataset, which usually has a larger cardinality than the labeled one. However, the nature of SSL learning from discrete images without the concept of classes, it highly relies on the intra-class similarity in the conceptual space. In our case, due to the limited resources in eczema related clinical image, even if we are aiming to unlabeled image, we are not likely to increase the dataset size by much, since distinguishing eczema from other skin diseases is already an annotation. Furthermore, the inherent low intra-class similarity nature of eczema subtypes means that SSL is not likely to produce a good result. Therefore, SSL is not an idea approach in our case.

Metric Learning:

Metric learning is a supervised learning algorithm that is used to predict similarity between images by mapping the input into a vector space, it has a wide application across face recognition, personal re-identification and few-shot classification. It has a loss that rewards putting images in the same class closer and further from the images in different classes in the embedding space, which means the model aims to make images closer to other images belong to same class closer and push it far from the images from other classes. The fundamental theory of metric learning model can still be simple, it’s still a CNN, not a CNN for producing classification results but a CNN to encode semantic content of the image. In our experiments, we tried two different loss functions which have great impact on the result. Firstly, we tried a simple triplet loss, which input a triplet of three images a time during training to calculate the loss. Each time the model iterates three images, the anchor image, positive image and negative image to do a forward propagation through CNNs with shared weight, giving out an embedded vector. The model’s structure can be seen in the following graph.

Figure 3, diagram of triplet network

Notice that a l2 normalization is applied on the output vectors, making the embedding space a hypersphere. Therefore, the model is mapping the high-dimensional image input to a spheric embedding space where the distance means the similarity in semantic meanings.

The loss in the image is defined as:

where represent anchor image, positive image, negative image, encoder (CNN) and a hyperparameter called margin.

The loss reaches zero when

Which means the geometrical distance between the positive and anchor image plus the set margin is less than the distance between the negative images. Therefore, the model’s goal is to push negative image further and pull positive closer in the embedding space.

As a result, the model will learn a way to map images to a meaningful embedding space, where the distance indicates the similarity between images.

However, this loss function is highly inefficient, the sampling of triplets from the dataset is random, which means that in most cases the model is iterating on the ‘easy’ triplets. For instance, an image of atopic dermatitis might be given with an anchor image of nummular dermatitis, which is visually very different. It makes the loss easily drop to zero, while the model cannot learn useful features from it. Thus, the training becomes useless, especially in distinguishing the subtle differences between the real hard classes.

Therefore, we thus dive into using an alternative loss function, batch hard triplet loss, which is a more advanced loss used in metric learning has a similar thinking as the triplet loss but more intelligence in dealing with negative and positive images. The batch hard triple loss instead taking a triple to calculate the loss, it uses up the whole batch of images. The batch of images is designed to contain equal number of images from each class, in our case 5 images per class, giving a total batch size of 35. The loss will calculate a similarity matrix of all pairs in the batch and find out the positive pair (a pair of images belong to same class) that have the furthermost distance and the negative pair having the closest distance. Then the model calculates the triplet loss in the same way as the above but instead of using the distance between a fixed anchor image and the fixed positive and negative images, it uses the distance of the hardest positive pair and negative pair in the whole batch. This ensures that the loss is focusing on the images that are hard to distinguish correctly, making the learning process more challenging for the model while being more efficient in learning the features. Therefore, we have made a huge improvement from our baseline model in solely triplet loss.

Furthermore, the batch hard triplet loss is still just considering the hardest triplet only, we thus tried a state-of-the-art loss in metric learning, circle loss. Circle loss would give an adaptive weight to each pair in the batch. However, the experiments on using circle loss fails to give any improvement to the model.

Moreover, the model we have from either SSL or metric learning is just an encoder to produce a representation of input image. In order to do the classification task, a decoder that transfers from the representation to classes is needed. We have tried many different ways to conduct the classification task, including the shortest distance method, test-time augmentation, k-nearest neighbors, adding logistic regression headers or MLPs on top of the embedding vectors. Different classification methods would lead to different model performance. Our experiment results show that, the shortest distance method is most effective method in classification; it produces the best results among all the classification methods.

4.4 Model Training

Models are trained in Nvidia GPU. We use Adam optimizers with cosine learning rate decay to prevent overfitting, having an initial learning rate of 1e-4. Early stopping is also applied with patient of 10 epoch, the model will stop training if the macro f1 score has not improved for 10 epoch.

4.5 Model Evaluation

The model evaluation is also a crucial part of the whole development, due to the special need of clinical applications, the model must be evaluated carefully to attain a result close to the actual practical performance of the model. The severe data imbalance problem indicates that accuracy is not a reliable metric to evaluate the model since the model can give up the minor classes to achieve high accuracy. Therefore, metrics including recall, precision and F1-score should be used to evaluate models. These metrics considered not only the overall accuracy of the model, but also the specific predictions in individual classes.

Recall, precision and F1-score are metrics that are used widely in medical field, are considered reliable metrics when dealing with class imbalances. It differs from the simple accuracy metric, which is defined as the ratio correct classified sample over total sample, which might be misleading under imbalanced data. Recall is defined as the true positive rate, the ratio between correctly classified actual positive over total actual positive, or true positive over false negative and true positive, it reveals the possibility of detection of each individual class. Precision is defined as the ratio between correctly classified actual positive and the total amount of positive classified. Recall improved with false negative decrease while precision improved with false positive decrease, by taking a hormonic mean of these two metrics, we obtained the F1-score. Notice that these metrics are calculated per class, a macro metric is calculated by averaging the result from each class.

5. Experiment

5.1 CNN models

We tried fine-tuning our dataset on a Resnet 50 model pre-trained on ImageNet.

Firstly, we tried a regular fine-tuning model, we use the Resnet model and adding a fully connected layer with size 256 before putting into a SoftMax layer to give a probability output, training with standard category cross-entropy loss. The standard fine-tuning model’s result is shown below:

Figure 4, training result of standard fine-tuning model

From the result, classic overfitting caused by class imbalanced can be seen. The loss shows that the model’s train loss drops steadily but the validation loss tells a different story, it began to stagnate after an initial drop and even rise in the last half of the training, this is clearly a sign of overfitting. Looking at the performance metrics, the problem is more serious, the training accuracy rapidly increases to almost 1, which means the model is not making any mistakes on the training data, this means the model has memorized the whole training dataset. What make it differ from a normal overfitting is that, the validation accuracy is also in a suboptimal level with about 90% accuracy, however, as abovementioned, the accuracy value can show nothing in a highly imbalanced dataset, thus, when we look at the F1 scores, the validation F1 score is just 40%, the huge gap between accuracy and F1 score shows that the model is not only overfitting, but overfitting by imbalanced data.

The result from standard CNN model leads us to implement state-of-the-art techniques to eliminate class imbalances problem, class weight and focal loss. Both of the techniques are adding weight to minor classes during training, forcing the model to care about failure in minor classes. The result of the model after adding class weight and focal loss is shown below:

Figure 5, The training result of the model with class weight and focal loss

The graph above shows an even more catastrophic result. The loss starts low and stays low throughout the training, which means the model thinks it has already done well in its work, but apparently it is not. The almost straight loss curve shows that there is probably a gradient problem, either vanishing or unstable, that the model can barely learn. From the metrics, we can see the model’s failure, the model is basically just a random guesser, there is no sign of learning from the model. We hypothesized that the failure is likely to be caused by the extreme imbalances in our dataset. Although the class weight and focal loss are techniques designed for dealing with class imbalances, our class might be too extreme, especially with a limited dataset size. The gradient calculated from wrong classification of minor class would likely to be enormous and chaotic, it is several times bigger than classifying wrong a major class image. This would likely lead to optimizer’s incorrect steps and make the model stuck in a wrong local minimum.

From the unsatisfactory results above in both standard and advanced CNN models, we conclude that simple CNN approaches is not viable to deal with our problem.

5.2 Generative models

In terms of the generative models, we have tried three state-of-the-art GAN in dealing with limited data, including FastGAN, StyleGAN2-ADA and ProjectedGAN. For all of these models, we use stasis dermatitis, the class we got least images, as the train dataset to train a generative model on stasis dermatitis. All the models are pre-trained on FFHQ dataset since it’s impossible to train the model from sketch with just about 70 images.

5.2.1 FastGAN

FastGAN model is a lightweight model to train in a short period of time. It includes a self-supervised autoencoder within its discriminator, this prevents the learning from collapsing early, making it suitable for training on small dataset.

The results of FastGAN is shown below, where distortion can be seen clearly, which means the model is not powerful enough to learn the distribution of the dataset, leading us to shift to more powerful models.

Figure 6, output of FastGAN after train for 40000 iterations

5.2.2 StyleGAN2-ADA

StyleGAN2-ADA, is a direct evolution of the influential model StyleGAN2, by adding the Adaptive Discriminator Augmentation (ADA) mechanism, which adaptively applied a series of strong data augmentation to the training data. This technique helps the model to train a stable GAN on a small dataset by preventing overfitting. Therefore, we chose this model to cope with our limited data problem.

StyleGAN2-ADA is the most successful model among the three listed, it can produce recognizable images. It manages to capture the general features of stasis dermatitis. However, from the results shown below, it can be clearly seen that most images are in distortion. We have considered it as underfitting, but the training result from training for more epochs shows the sign of overfitting, where the images all turn red. Thus, the model is still not capable enough to be used to largely augment our dataset since it cannot produce stable correct images.

Figure 7, output images of StyleGAN2-ADA after training for 600kimgs

Figure 8, output images of StyleGAN2-ADA after training for 1600 kimgs

5.2.3 ProjectedGAN

ProjectedGAN is a state-of-the-art model that instead of training the generator to produce images in the high-dimensional pixel space directly, it uses a pre-trained feature extractor to “project” the real and generated image into a semantically meaningful feature space that is more compact. The discriminator then operates in this feature space. This model shows promising results in improving the training stability and the convergence speed. Due to the challenges from the another two models in training, we thus shift to try out the ProjectedGAN.

However, despite the theoretical advantages, the ProjectedGAN performed catastrophically in the experiment. The result shown below shows that the model suffered from mode collapse after a short period of training. The images generated is all identical, which means the generator found a shortcut to trick the discriminator. The model completely fails to learn the distribution of the training data.

Figure 9, output of ProjectedGAN after train for 400kimgs

5.2.4 Conclusion

From the above results from three models, we can conclude that using GAN to augment the size of dataset is unrealistic. The fact that our dataset has a high intra-class variance with limited datasets is impossible for GAN to learn the distribution even with the state-of-the-art models. Furthermore, the GAN models require massive computational resources, even the FastGAN or ProjectedGAN that is designed to converge faster would require more than 15 hours of training, making tunning on the GAN more difficult, the results above might not be tuned to the model’s limit, but it is still unlikely to see the GAN models learn well in our dataset. Therefore, we need to work on the classification architecture to gain more performance.

5.3 Metric Learning models

5.3.1 Baseline model

The first metric learning model we tried is a simple triplet loss model. The model uses pre-trained ResNet50 model as the base model of the encoder, using the shortest distance method to further process the embedding vector output of the image. The model has an output shape of 256, the margin used is 1.

We trained the model for 10 epochs on our original dataset, the experiment result is shown in the graph below:

Figure 10, training result of our base model

From the result, the model achieved the highest macro f1-score of 0.6323, with a loss of 0.0253. The result shows that the model’s train loss drops rapidly in few epochs, reaching a value near to 0 in the end, while the f1-score is not yet satisfactory. This shows that the model is learning too easily, it can easily make the loss low but not really learn the features of data well, and overfitting is clearly shown.

The classification report from the best epochs on the evaluation dataset is shown below:

Class

Precision

Recall

F1-Score

Support

Atopic dermatitis

0.81

0.9

0.85

Contact dermatitis

0.65

0.6

0.63

Dyshidrotic eczema

0.54

0.58

0.56

Neurodermatitis

0.88

0.47

0.61

Nummular eczema

0.5

Seborrheic dermatitis

0.56

0.6

0.58

Stasis dermatitis

0.7

Accuracy

0.71

194

Macro Avg

0.66

0.62

0.63

194

Weighted Avg

0.71

194

From the table, we can clearly see that the class imbalance is still affecting the model’s performance, minor classes have significantly lower performance than the major class. However, it is still a considerable increasement from the CNN models, showing that it’s a right shift to the metric learning approach.

5.3.2 Data augmentation, learning rate scheduler, partial fine tuning

Focusing on the problems shown in the baseline model, which is mainly overfitting, we tried to apply machine learning tricks to solve overfitting and make the training smoother.

To deal with overfitting, we applied data augmentation on the dataset, to increase the diversity of the dataset and prevent the model from memorizing the training data easily. A series of augmentation including random flip, rotation and contrast was applied to the dataset. Furthermore, to make the model training smoother, we applied a learning rate scheduler to the training with cosine decay, to help the model reach the optimum point. In addition, we thought that the ResNet50 model might be too big in our data size which would lead to overfitting, therefore, instead of fine-tuning the whole model, we freeze the first 80 layers of the model during training, where these layers have already been trained well in ImageNet dataset to extract fundamental features that resemble across all modelling of images.

The train result was shown in the below graph:

Figure 11, training result of the augmented model

From the figure, it can be seen that the loss drops less rapidly than the base model, although it’s still dropping at a fast pace. The final result of the model was a macro f1-score of 0.6405 with a loss of 0.0452. The f1-score is almost the same as the baseline model and the loss almost approaches zero as well. This shows that, although the machine learning technique we used is useful in slowing down the loss drop, it is not yet enough to solve the problem.

The classification report of the model on the validation dataset is shown below:

Class

Precision

Recall

F1-Score

Support

Atopic dermatitis

0.89

0.74

0.81

Contact dermatitis

0.56

0.64

0.6

Dyshidrotic eczema

0.5

0.83

0.62

Neurodermatitis

0.56

0.6

0.58

Nummular eczema

0.67

0.8

Seborrheic dermatitis

0.54

0.47

0.5

Stasis dermatitis

0.4

0.57

Accuracy

0.69

194

Macro Avg

0.67

0.64

194

Weighted Avg

0.72

0.69

194

It can be seen that the bias of model on the minor class still exists.

As a result, from the performance of the augmented model, we see that the loss function might not be appropriate. The model can pass the triplet loss easily, leading to overfitting in the result. It indicates that we need to find for a loss function that the model cannot learn too easily, which is our next attempt, using the batch hard triplet loss.

5.3.3 Batch Hard Triplet Loss Model

We analysis that the reason behind the model’s fast convergence is in the iteration of data during training. In each training step in an epoch, the model is training on a batch of random triplet from our original dataset, and from the training result above, it seems that most of these random sampled triplets from our dataset are easy for the model, leading to a low loss value. These easy triplets provide very limited information for the model to learn the features between eczema subtypes.

Therefore, we switch to using batch hard triplet loss, which considered using the hard pairs to calculate the loss. This loss function would make the training of the model harder, forcing it to learn real features that can be used in classifying eczema. Furthermore, we considered that the ResNet50 model, although powerful, might not be enough to cope with our cases and it may cause overfitting of the model. Thus, we applied EfficientNetB0 model, the lightest one in the EfficientNet series as a lightweight modern CNN architecture, to be our new base model.

The training results are shown below:

Figure 12, training result of the batch hard mining triplet model

From the results shown above, the implementation of new loss function is absolutely working, the loss value starts at a much higher level, 1.5232 compared to the 0.3773 of the baseline model, since the value of batch hard triplet loss and triplet loss is calculated in a similar way, this means that the loss now is reflecting the real hard triplet during the training. The drop of loss also becomes gentler and smoother. The best result is a macro f1-score of 0.7318, it’s a big improvement from our baseline model.

The classification report on the validation dataset from the best performed model is shown below:

Class

Precision

Recall

F1-Score

Support

Atopic dermatitis

0.90

0.89

Contact dermatitis

0.60

0.82

0.69

Dyshidrotic eczema

0.67

Neurodermatitis

0.88

0.47

0.61

Nummular eczema

0.91

0.83

0.87

Seborrheic dermatitis

0.80

0.53

0.64

Stasis dermatitis

1.00

0.60

0.75

Accuracy

0.78

194

Macro Avg

0.82

0.69

0.73

194

Weighted Avg

0.80

0.78

194

It can be seen that the model’s performance in minor classes improved as well, most classes can achieve a f1-score over 60%.

The positive results shown above show that switching to batch hard triplet loss is the correct direction. The next step would be to further improve the model’s performance.

5.3.4 Final Model

To continue improving the model, we decided to enhance the model’s capability. Since the former results show that we successfully overcome overfitting, it means that we can increase the model’s capability to extract more useful features. Therefore, we change the base model to EfficientNetB2, a heavier model than the B0, by using this, the encoder capable for learning more features. Moreover, we increase the output dimensions of the model from 128 to 256 so that a vector with more features is extracted.

The result is shown below:

Figure 13, training result of our final model

And the classification report on validation set is shown below:

Class

Precision

Recall

F1-Score

Support

Atopic dermatitis

0.91

0.89

0.90

Contact dermatitis

0.65

0.74

0.69

Dyshidrotic eczema

0.56

0.75

0.64

Neurodermatitis

1.00

0.40

0.57

Nummular eczema

0.62

0.83

0.71

Seborrheic dermatitis

0.91

0.67

0.77

Stasis dermatitis

0.90

Accuracy

0.78

194

Macro Avg

0.79

0.74

194

Weighted Avg

0.81

0.78

194

From the graph above, by training more epochs, we can see the loss reaching the bottom and the f1-scores peak. The best performance happens in epoch 20. The best result is a macro-f1 score of 0.7412. This is a huge improvement, considering the macro f1 of our baseline model is just 0.6323. More importantly, the classification results show that the performance of the model is still comparatively promising than the previous trials, most minor classes have a f1 score reaching 70%. This shows the effectiveness of the technique we used to improve the model’s performance from the baseline model.

As the best model we achieved, we conducted the final test using the test dataset on this model. And the results are shown below.

Figure 14, confusion matrix of the results on test dataset

Class

Precision

Recall

F1-Score

Support

Atopic dermatitis

0.88

0.86

0.87

Contact dermatitis

0.68

0.80

0.73

Dyshidrotic eczema

0.83

Neurodermatitis

0.80

0.27

0.40

Nummular eczema

0.57

0.67

0.62

Seborrheic dermatitis

0.82

0.60

0.69

Stasis dermatitis

0.60

0.90

0.72

Accuracy

0.77

194

Macro Avg

0.74

0.70

194

Weighted Avg

0.78

0.77

0.76

194

The result on test dataset drops by about 5% in terms of macro f1, which is normal since test dataset is not seen by the model until the last evaluation. The model is never trained or tuned on test dataset, so it provides a reliable reference of the model’s performance on unseen data.

5.3.5 Circle Loss

Apart from the above result, we also tried using the state-of-the-art loss function, circle loss, to further improve the model. We use the same set-up as our best model to train. The result is shown below.

Figure 15, training result of model with circle loss (gamma=64)

From above figure, it can be clearly seen that the model is just not learning effectively, the loss is not decreasing and the f1 scores did not increase as well, which means the model cannot calculate a meaningful gradient, probably drops to zero. We doubt that his might be a hyperparameter problem, the above model uses a hyperparameter of 0.75 margin (as a boundary for similarity in calculating the loss) and 64 gamma (as an amplifier in the loss calculation), which is a common starting point for circle loss. The above result might be due to the gamma value is too large, so the gradient calculated is not stable and thus the model fails to learn. Therefore, we tried to start with a much smaller value, gamma=8, and the result is shown below:

Figure 16, training result of model with circle loss(gamma=8)

With a much smaller gamma value, we see the problem still exists, the model cannot find a useful gradient so it cannot converge, probably the gradient vanish to zero. The loss value stopped changing just after few epochs. Therefore, the problem is not likely to come from the hyperparameter tuning, instead, we hypothesize that the problem came from the initial state of the model which is a common cause of gradient instability.

Therefore, we conduct a fine-tuning on our previous best model, and the result is shown below:

Figure 17, training result of the model using circle loss with fine-tuning.

The result from the graph is even more serious, the model is not learning at all, the loss is a straight line. It shows that the model’s failure is probably not due to the initial state of the model, it is likely that the circle loss’s holistic nature is not suitable to our dataset. The circle loss function, unlike the batch hard triplet loss that only looks at the hardest triplets, looks at all triplets in the batches and gives weights to each triplet based on their similarities. Then, the high intra-class variance of the dataset might lead to a complicated loss for each triplet that is all in different directions, which added up to zero gradient as a result, leading to the model’s failure. In conclusion, the circle loss function is not suitable for our dataset.

6. Discussion

6.1 The Significance of Rigorous Dataset Splitting

A rigorous dataset splitting, although easy to implement, is a paramount requirement to be fulfilled to train a robust and reliable model that can be used in practical clinical practice. Without a rigorous dataset splitting, the evaluation result of the model can reflect nothing of the model’s performance but leading the developer to overestimate the model’s ability, leading to catastrophic result in actual usage of models in practice.

We found that many of the works related to this field, although having a tremendous result in both test and train dataset, the data leakage in dataset splitting reveals that the evaluation results are likely to be biased and erroneous. For a model with medical purposes, such biases are toxic, it means an overestimate of the model’s performance. One common data leakage we found from others’ works is that, even though having the right data splitting, the data splitting is conducted after data augmentation, which means that the image in the validation or test dataset might just be a transformed version of the train dataset, the models then naturally have incredible performance, this mistake is widely found in many works, like[16]. We understand splitting out original data that is very limited in medical field is hard, however, the result from the wrong data splitting means nothing, it cannot reflect any real performance of the model, the whole evaluation on the model is invalid.

Therefore, rigorous data splitting that no data leakage happens among the train, validation and test dataset is essential. In our work, we conduct rigorous data splitting randomly from our integrated dataset before the data was processed in any means. Thus, although there is a significant discrepancy between the classification results from different subsets of data (There is about 5% differences between the validation and test dataset), the result is real and valid to reflect the model’s performance.

6.2 Significance of Our Work

In terms of the significance of our work, the model we proposed provides a rapid, automated and quantitative method for phenotyping skin conditions. Instead of relying on the old slow, manual, subjective diagnosis method, our model can accelerate the diagnosis stage of biological research that researchers can get instantly result from our model, it may largely reduce the time taken for experiments.

Moreover, our application of metric learning approaches to biological problems makes our work differ from a purely classification task. The semantically meaningful embedding space created by metric leaning can be considered as a biosensor that extracts the features of images and maps them to the classes.

Furthermore, we considered our work as not only an investigation to create an eczema diagnostic model but a way trying to model on dataset with severe problems, including limited data size, imbalances and high intra-class invariance. Although the result of our model is not perfect, the trials we made are still valuable experience for other problems dealing with similar problems.

In other fields that may require modelling on a bad-conditioned dataset, including synthesis biology, chemistry, or even social sciences, our work sets up an experimental-based, data-driven and authentic example in the challenges and the approaches. Our source code, including both training scripts and the model’s weight is open sourced, any researcher can create a model from their own dataset easily with our codes even if they are not familiar with machine learning.

6.3 Limitation of Our Model

Although having a great improvement from the baseline model, our final model is still not perfect. The macro-f1 we achieved is merely 70%, which is considerably low for a model that aims for practical applications. From the classification report on our best model, although most classes attain a result near 70%, we can see some individual class like Neurodermatitis having worse results. These results show that the model we built is still a prototype and far from practical applications.

However, despite the fact that the model is underperforming, we have tried out a wide range of machine learning approaches to solve the problem. It is not likely that there will be significant improvement from the model side, the main problem we see during the building of model arose from the inherent problem in our dataset. The nature of eczema subtype images having high intra-class variance means a large number of images would be needed for building up a robust model since a limited number of images like ours means some features of certain subtypes would only appear for a few times in the dataset. The model can hardly learn those features well. The high accuracy result from CNN shows that without class imbalances, even the simplest CNN models can solve the task. Therefore, the data side problem is the prime priority problem in our work. Due to our limited budget, we cannot create a dataset with nice labelling and abundant data size, we can only train our model on the data we collected from the internet, it thus becomes the major limitation of our work.

Another limitation arose from our project is that, from the results we see in using SOTA techniques like focal loss, circle loss and ProjectedGAN, all of these SOTA methods fail to produce a satisfactory result. We believe that this is due to over dataset, its problem is too serious that the complex learning algorithms of models would easily go wrong, leading to unstable gradients in training, resulting in the catastrophic results we see above.

6.4 Possible Extensions to This Project

There are multiple extensions to our project in the future, by trying out these extensions, a better performed model may be achieved.

Firstly, acquiring more data. This is definitely the prime priority to improve our result, by making a larger, more diverse, more balanced dataset, most of the training challenges can be eliminated. The model can also be taught easier to classify the subtypes with a better dataset. Furthermore, a better dataset means that more features is included in training, which means that the model would have better practical performance in detecting a wider range of samples. Unfortunately, expanding the dataset would require massive investment in collaborating with patients and dermatologists to gain high-quality data, and due to the nature differences in the incidence of different subtypes, the imbalances problem can hardly be solved as well.

Thus, similar to acquiring more data, another data side extension is reconsidering the generative model. From the above result we achieved, although having most of the majority of generated images being of low quality, some generated images are still considerably useful. By applying massive data cleaning on the almost infinitive generated data, it is still possible to expand the data size by generative models.

Furthermore, although we have tried multiple machine learning approaches, there are still several approaches we have not tried due to the time and budge limitation. These model size changes may be able to boost the model’s performance greatly. For instance, we did not try on self-supervised learning models like MoCo, although we have a theorical ration behind that the SSLs are not likely to produce better results than metric learning, it is still a valid approach to try out. Moreover, in generative models, we focus on only GAN, while in the past few years, diffusion models have emerged to obtain SOTA performance in generating images. Unlike SSLs, diffusion models do have a large chance to outperform GANs, its advantages in not only trying to simulate the distribution in the dataset but also understanding prompts during generating images means it can prevent distortion in images from a higher level. However, due to the expensive cost of running diffusion models, we do not have enough resources and time to use diffusion models to augment our dataset. The use of diffusion models is still a possible extension to our project.

7 Conclusion

The proposed eczema diagnosis platform helps address the global burden on eczema diagnosis. Deep-learning models are used to alleviate the tremendous need for affordable accessible dermatology-related diagnosis. It’s the goal for this project to create a user-friendly cross-platform application that achieves the purpose.

We have achieved the goal quite well. We successfully created a platform that classifies and identifies the type of eczema for the user has, with a large assortment of approaches we tried. This research can also provide insights into deep learning classification models with severely unbalanced datasets and an adequate CNN model with low computational requirements. This can be an anchor for further related research.

The related works are plentiful in the academia. Though only a small amount of CNNs are avant-garde and state-of-the-art enough for a low dataset size. Considering the diverse problems we faced, this can be an advantage for further research in the field.

There is possible future enhancement in related research such as dermatology-based diagnostics and severity assessment. There might even be real-time patient-doctor communication and other dermatological diseases, instead of the 7 types of dermatitis that’s covered in our model.

Reference

[1] Anaconda. 2020 state of data science. https://www.anaconda.com/resources/ whitepaper/state-of-data-science-2020. Accessed 3rd March, 2025.

[2] Rahman Attar, Guillem Hurault, Zihao Wang, Ricardo Mokhtari, Kevin Pan, Bayanne Olabi, Eleanor Earp, Lloyd Steele, Hywel C. Williams, and Reiko J. Tanaka. Reliable detection of eczema areas for fully automated assessment of eczema severity from digital camera images. JID Innovations, 3(5):100213, 2023. [3] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.

[4] Jianpeng Cheng, Li Dong, and Mirella Lapata. Long short-term memory-networks for machine reading. arXiv preprint arXiv:1601.06733, 2016. [5] Ivette AG Deckers, Susannah McLean, Sanne Linssen, Monique Mommers, CP Van Schayck, and Aziz Sheikh. Investigating international time trends in the incidence and prevalence of atopic eczema 1990–2010: a systematic review of epidemiological studies. PloS one, 7(7):e39803, 2012. 22

[6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.

[7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pretraining of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019.

[8] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.

[9] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014.

[10] Glugon. Model zoo. https://cv.gluon.ai/model_zoo/classification.html. Accessed 6th March, 2025.

[11] Google. Architectural overview. https://docs.flutter.dev/resources/ architectural-overview.

[12] Kaiming He, Georgia Gkioxari, Piotr Doll´ar, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961– 2969, 2017. [13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition, 2015.

[14] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.

[15] Leo Huang, Wai Hoh Tang, Rahman Attar, Claudia Gore, Hywel C Williams, Adnan Custovic, and Reiko J Tanaka. Remote assessment of eczema severity via ai-powered skin image analytics: A systematic review. Artificial Intelligence in Medicine, page 102968, 2024.

[16] Masum Shah Junayed, Abu Noman Md Sakib, Nipa Anjum, Md Baharul Islam, and Afsana Ahsan Jeny. Eczemanet: A deep cnn-based eczema diseases classification. In 2020 IEEE 4th international conference on image processing, applications and systems (IPAS), pages 174–179. IEEE, 2020.

[17] Kiki Purnama Juwairi, Dhomas Hatta Fudholi, Aridhanyati Arifin, and Izzati Muhimmah. An efficientnet-based mobile model for classifying eczema and acne. In AIP Conference Proceedings, volume 2508. AIP Publishing, 2023.

[18] MR Laughter, Mayra BC Maymone, Soudeh Mashayekhi, Bernd WM Arents, Chante Karimkhani, SM Langan, RP Dellavalle, and Carsten Flohr. The global burden of atopic dermatitis: lessons from the global burden of disease study 1990–2017. British Journal of Dermatology, 184(2):304–309, 2021. 23

[19] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, ChengYang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, pages 21–37. Springer, 2016.

[20] Amna Mehboob, Akram Bennour, Fazeel Abid, Emad Chodhri, Jawad Rasheed, Shtwai Alsubai, and Fahad Mahmoud Ghabban. Deep learning-driven skin disease diagnosis: Advancing precision and patient-centered care. Scalable Computing: Practice and Experience, 26(1):388–397, 2025.

[21] Joseph A Odhiambo, Hywel C Williams, Tadd O Clayton, Colin F Robertson, M Innes Asher, ISAAC Phase Three Study Group, et al. Global variations in prevalence of eczema symptoms in children from isaac phase three. Journal of Allergy and Clinical Immunology, 124(6):1251–1258, 2009.

[22] American Academic of Dermatology Association. Eczema resource center. https: //www.aad.org/public/diseases/eczema. Accessed 1st March, 2025.

[23] Keiron O’shea and Ryan Nash. An introduction to convolutional neural networks. arXiv preprint arXiv:1511.08458, 2015.

[24] Kevin Pan, Guillem Hurault, Kai Arulkumaran, Hywel C. Williams, and Reiko J. Tanaka. Eczemanet: Automating detection and severity assessment of atopic dermatitis. In Mingxia Liu, Pingkun Yan, Chunfeng Lian, and Xiaohuan Cao, editors, Machine Learning in Medical Imaging, pages 220–230, Cham, 2020. Springer International Publishing.

[25] Prajwal GM Pathri Mahanthesh, Narravula Raaja Chaithanya Vedh, N Rahul, and Reddy Santosh Kumar. Classification of skin disease images using efficientnet transfer learning technique.

[26] Oona Rainio, Jarmo Teuho, and Riku Kl´en. Evaluation metrics and statistical tests for machine learning. Scientific Reports, 14(1):6086, 2024.

[27] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.

[28] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards realtime object detection with region proposal networks. Advances in neural information processing systems, 28, 2015.

[29] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision, 2015.

[30] Kenneth Thomsen, Anja Liljedahl Christensen, Lars Iversen, Hans Bredsted Lomholt, and Ole Winther. Deep learning for diagnostic binary classification of multiple-lesion skin diseases. Frontiers in medicine, 7:574329, 2020.

[31] Philipp Tschandl. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions, 2018. 24

[32] Jasper RR Uijlings, Koen EA Van De Sande, Theo Gevers, and Arnold WM Smeulders. Selective search for object recognition. International journal of computer vision, 104:154–171, 2013.

[33] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.

[34] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV), pages 3–19, 2018.

[35] Aston Zhang, Zachary C. Lipton, Mu Li, and Alexander J. Smola. Dive into Deep Learning. Cambridge University Press, 2023. https://D2L.ai.

[36] Hangning Zhou, Fengying Xie, Zhiguo Jiang, Jie Liu, Shiqi Wang, and Chenyu Zhu. Multi-classification of skin diseases for dermoscopy images using deep learning. In 2017 IEEE international conference on imaging systems and techniques (IST), pages 1–5. IEEE, 2017.

The Abstract

The dossier shown on the other tab explains, circumstantially, the process of achieving the said goal. A brief abstract of such, mostly the challenges, methods and experiments, is to be summed up here, for the sake of convenience and expediency.

We ought to inaugurate a connection between clinical dermatology and the avant-garde technology of machine learning, making the theory of a lightweight pragmatic A.I. clinical dermatology much apodictic. We noticed that there’s a lack of costless accessible dermatology identification method, which could be a major problem for the less privileged people. We aimed to create software that can, fundamentally, recognize the type of eczema.

The heterogeneity in the user’s complexity and other visual factors such as glare and illuminance levels would be taken into account of the final model. The application of the software is multifarious.

Challenges

The data size is limited and highly imbalanced due to privacy concerns and other problems. The cost for such data and annotations is extortionate. Being only able to obtain a deficient amount of poxy data, the said data-related problems were the biggest hindrance we faced; hence being the most crucial part of the work.

The multifariousness of the data we’ll face caused by lighting, user’s complexation or the eczema’s location would have to be considered, therefore setting a high bar for the model’s ability to generalize images.

Considering the potentially clinical usage of the software, a reliable piece of relevant information should be provided as per the patients’ requirements.

Methodology

DATA AUGMENTATION

We obtained an assortment of 2191 images across 7 subtypes in various open databases and search engines. There’s a perilously great class imbalance across classes: the largest class has 1152 images, and the smallest class has 77. Due to the high requirements, the disastrous data wouldn’t ‘do the job’. Data augmentation is requisite, in this case, to establish a robust model.

The dataset is first split into 3 subsets – training, validation and test. Training is for the model’s training, validation for evaluating, and test for testing. The test set wouldn’t undergo any tuning to ensure accuracy to imitate when the model is dealing with crude raw images. There’s an even distribution of images throughout the 3 subsets, optimally.

The images are resized to dimensions 224 x 224 pixels and normalized. Geometric transformations – flips, zooming and rotations; photometric transformations – brightness and contrast are all applied to the data randomly, to parrot clinical environments.

MODEL DESIGN

It’s of paramount importance to our development process. A systematic evaluation of techniques drives our modeling approach, and a wide range of methods is explored.

Convolutional neural network was the to-go architecture for our model. We used residual neural networks pre-trained on the ImageNet dataset as a backbone. Fine-tuning on convolutional blocks and a classification header with categorial cross-entropy loss is done at first. Although there’s an ostentatious high accuracy, overfitting appeared. Therefore, class weights and focal losses are implemented to prevent bias. Alas, the model fails to converge when such is implemented due to data imbalances.

Generative models were considered to compensate for the shortage of minor classes. A balanced dataset could be used to create a concrete model with the approach above. We used FastGAN, StyleGAN2-ADA and ProjectedGAN. The generated images were unsatisfactory with clear distortion caused by the small amount of data for minor classes.

Self-supervised learning, despite being important in unlabeled images, relies on intra-class similarities. It’s likely not an ideal approach due to low intra-class similarities, since we probably won’t increase the size of the dataset by much.

Metric learning has a wide range of applications. It has a loss that rewards putting images in the same class closer. We tried 2 different loss functions, that is, a simple triplet loss and a batch hard triplet loss. Instead of using a triple for the loss, it uses the whole batch of images. There’ll be a similarity matrix and hence the positive pair with the furthermost distance. This ensures the loss being focused on the images that are the hardest to distinguish. Hence, there was a big improvement when that’s implemented.

TRAINING AND EVALUATION

Models are trained with Nvidia GPU’s, using Adam optimizers with cosine learning rate decay. Initial learning rate at 0.0001, early stopping at 10 epochs.

The model must be evaluated carefully for it’s clinical usage. Accuracy isn’t reliable for the severe imbalance, hence recall, precision and F1-score are used. Such metrics are more reliable dealing with class imbalances.

Experiments

CNN

Traditional CNN’s are first used as said in theory above. We used ResNet with a 256-size layer and a SoftMax layer with standard cross-entropy loss to give probability output. The occurrence of a typical overfitting is seen. The validation accuracy was at 90% and the validation F1 score at 40%. The results, seen in the dossier, force us to implement other techniques.

We implemented class weight and focal loss on top of the model aforementioned. The results were catastrophic, seen in the graph within the dossier with a consistent low loss and the gradient is flat throughout.

GAN

GAN’s like FastGAN, StyleGAN2-ADA and ProjectedGAN were used as mentioned. We used stasis dermatitis as the train dataset, with all the models pretrained on FFHQ.

In summary, FastGAN’s performance was poor, StyleGAN2-ADA could recognize the general features of stasis dermatitis but there’s clear distortion. Being the GAN with the best performance, it’s still not capable enough to be used to augment the dataset. ProjectedGAN suffered from mode collapse after a short period. To conclude, using GAN to augment the size of dataset is unrealistic.

METRIC LEARNING

Metric learning models, as mentioned, were the most promising theoretically. A simple triplet loss model on ResNet50 is first used. There’s a macro F1-score of 0.6323 and a loss of 0.0253. The loss decreases too rapidly, showing that the model learns easily, without grasping the features. There’s still an observable class imbalance and overfitting problem, though significantly better than CNN models.

We used data augmentation, learning rate scheduler and partial fine-tuning. We tried to increase the diversity of the dataset by randomized flips and rotations. The application of learning rate scheduler helps the model to reach the optimum point. Due to the size requirements of ResNet50, the first 80 layers of the model are frozen and fine-tune the rest to avoid overfitting. The macro F1-score is at 0.6405 and the loss at 0.0452. Again, the loss drops too quickly.

Batch hard mining triplet loss model is used then. Randomly sampled triplets are too easy for the model, leading to overfitting. The loss is calculated with the hard pairs with this. Moreover, the Resnet50 is replaced with EfficientNetB0 as our new base model to avoid overfitting for how lightweight it is. From the figure, the drop of loss is smoother, with the macro F1-score at 0.7318, as by far the best performing model. Hence, this is determined to be our final base model.

CONCLUSION

We changed the base model to EfficientNetB2. From the results, the best macro F1-score at epoch 20 is 0.7412, a prodigious improvement from the initial 0.6323. Minor classes also have a F1-score reaching 0.7. The result on the test dataset has a 5% drop in macro F1-score, still somewhat tolerable. This provides a reliable reference to the model’s actual performance on unseen data.

We tried circle loss too, but the results, despite numerous attempts, were not satisfactory.

This concludes the dossier, along with it's pragmatic approaches used to evince and ameliorate the model.
It's related Discussion and Conclusion are located on the previous tab, named dossier.

Dermeide

Our software, Dermeide, is built with Dart, a client-optimized language by Google. Delivering high-performance apps with smooth animations, it gives a quite tolerable responsive interfaces across platforms. Being cross-platform, it gives an adaptive approach to modern dermatology classification usages, demonstrates it's lightweight and adaptability.

The aforementioned software can be found here or on Google Play with the name Dermeide. Only the Android version of Dermeide is available currently.

On Methodological Rigor: The Critical Importance of a Pristine Test Set

In the pursuit of state-of-the-art (SOTA) performance, particularly in competitive domains like medical image analysis, the pressure to achieve near-perfect metrics is immense. Our extensive research into the classification of eczema subtypes has yielded a strong, reliable model with a Macro F1-score of approximately 70%. However, in reviewing existing literature, we frequently encountered studies reporting models with performance upwards of 95% using seemingly simpler models (Like a simple CNN from sketch) on similarly small datasets. This discrepancy leads us to a critical investigation not into the models themselves, but into the underlying evaluation methodology.

Our investigation revealed a common but scientifically invalid practice: performing the train-test split after applying data augmentation. We believe this methodological flaw is a primary source of inflated and misleading performance claims in the field. Our own experiments confirm that this error can erroneously inflate a model's perceived performance by over 25%, turning a realistic result into a spurious SOTA claim.

The Cardinal Rule of Machine Learning: The Sanctity of the Test Set

The ultimate goal of a machine learning model is to generalize—to make accurate predictions on new, unseen data that it did not encounter during training. Overfitting is always the priority to solve in machine learning works, even a simple MLP can remember all the samples in the world, but memorization makes no use, we need the model to have similar performance on new data, which is what it will face in practical uses. To measure this capability, the hold-out test set is the most important tool we have. For this tool to be effective, it must remain completely pristine and untouched throughout the entire training process. The model must see the test set only once, at the very end, during the final evaluation.

The common, correct, rigorous protocol for using data augmentation is as follows:

Split First: The original, collected dataset is split into training, validation, and test sets. This split happens once and is final. Noted that the validation splitting might be different, in case that is extremely data hunger, k-fold validation might be used, but the validation set is still guaranteed to be untouched during the specific iteration it was used for evaluating the performance during training.

Augment the Training Set: Data augmentation techniques (rotation, flipping, contrast adjustments, etc.) are applied exclusively to the training set. This is often done "on-the-fly," where a slightly different version of each training image is generated in every epoch.

Evaluate on Clean Data: The model is evaluated on the untouched, un-augmented validation and test sets. This is the final evaluation of the model, no modification on the model should be done after this.

This ensures that we are measuring the model's true ability to handle the natural variations of real-world data, not its ability to recognize a slightly modified version of an image it has already memorized. The above procedure is exactly what we have done to build our models, ensuring a rigorous evaluation throughout the investigation.

The Flaw: Data Leakage via Post-Augmentation Splitting

The flawed methodology we observed involves reversing these steps:

Augment First: Start with a small pool of original images. Create a Large, Synthetic Dataset by generating many augmented copies of each original image.

Pool and Split: Combine the original and augmented images into one large dataset, and then randomly split this mixed pool into training and test sets.

This seemingly small change completely invalidates the whole evaluation. It introduces a severe form of data leakage, where information from the training set "leaks" into the test set. For example, the original image_A.jpg may land in the training set, while a rotated_image_A.jpg lands in the test set.

When the model is evaluated on rotated_image_A.jpg, it is no longer performing a generalization task. It is performing a near-duplicate recognition task. The features it needs to learn are trivial, as its core visual features have already been exposed to model or even remembered by the model during training. The problem is transformed from a difficult medical diagnosis challenge into a simple game of matching the augmented image to its original one.

Quantifying the Impact: Our Replication Experiment

To confirm our hypothesis, we conducted a final experiment. We took our best, state-of-the-art metric learning pipeline—the exact same model that rigorously achieved a ~70% Macro F1-score on our rigorous hold-out test dataset. We then retrained it using the flawed post-augmentation splitting methodology.

The results were as serious as they were predictable. The model's performance skyrocketed to a 96.2% Macro F1-score on test dataset (namely unseen during the training), and the model learns incredibly easily, reaching nearly 90% in less than 15 epochs. The results are shown below:

Figure 1, training result with wrong data splitting methodology

Figure 2, confusion matrix of the model on test dataset

Figure 3, classification report on test dataset

This ~25% performance gap is not the result of a better model; it is the methodological error. It represents the difference between solving a real-world generalization problem and solving a trivial, self-referential recognition task. Moreover, the ~25% is not a reliable measurement of the impact brought by the wrong data-splitting method, in fact, with data leakage, nothing about the real performance can be known. Even if the model learns no features about the dataset, it may gain an over 95% performance as well, simply because they remember the few thousand images.

A Call for Rigor

While the pursuit of SOTA is a powerful driver of innovation, it must be coupled with unwavering scientific rigor. A model evaluated with flawed methodology cannot show any real performance, it is extremely toxic for practical application, especially in medical field.

For a scientific work, which champions the principles of robust and reliable engineering and results, this distinction is paramount. Our team made the conscious and critical decision to adhere to the correct, rigorous evaluation protocol. The results we present, therefore, are not inflated, but are an honest and realistic representation of our model's capabilities on this challenging problem. We believe that building trustworthy and reproducible software is the true measure of success, and we advocate for the adoption of strict validation standards as a cornerstone of all future work in this important field.

1. Introduction

1.1 Background and Motivation

1.2 Problem Statement

1.3 Objective and Scope

2. Related Work

2.1 Traditional Diagnosis Method

2.2 Previous Automated Approaches

3. Challenges in Eczema Classification

3.1 Data-Related challenges

3.2 Technical Challenges

3.3 Clinical and Practical challenges

4. Methodology

4.1 Data Collection

4.2 Data Preprocessing and Augmentation

4.3 Model Building

4.3.1 Traditional Convolutional Neural Networks Approach (CNNs)

4.3.2 Generative Data Augmentation

4.3.3 Representation Learning Approaches

4.4 Model Training

4.5 Model Evaluation

5. Experiment

5.1 CNN models

5.2 Generative models

5.2.1 FastGAN

5.2.2 StyleGAN2-ADA

5.2.3 ProjectedGAN

5.2.4 Conclusion

5.3 Metric Learning models

5.3.1 Baseline model

5.3.2 Data augmentation, learning rate scheduler, partial fine tuning

5.3.3 Batch Hard Triplet Loss Model

5.3.4 Final Model

5.3.5 Circle Loss

6. Discussion

6.1 The Significance of Rigorous Dataset Splitting

6.2 Significance of Our Work

6.3 Limitation of Our Model

6.4 Possible Extensions to This Project

7 Conclusion

Reference

The Abstract

The dossier shown on the other tab explains, circumstantially, the process of achieving the said goal. A brief abstract of such, mostly the challenges, methods and experiments, is to be summed up here, for the sake of convenience and expediency.

Challenges

Methodology

DATA AUGMENTATION

MODEL DESIGN

TRAINING AND EVALUATION

Experiments

CNN

GAN

METRIC LEARNING

CONCLUSION

This concludes the dossier, along with it's pragmatic approaches used to evince and ameliorate the model.It's related Discussion and Conclusion are located on the previous tab, named dossier.

Dermeide

On Methodological Rigor: The Critical Importance of a Pristine Test Set

The Cardinal Rule of Machine Learning: The Sanctity of the Test Set

The Flaw: Data Leakage via Post-Augmentation Splitting

Quantifying the Impact: Our Replication Experiment

A Call for Rigor

This concludes the dossier, along with it's pragmatic approaches used to evince and ameliorate the model.
It's related Discussion and Conclusion are located on the previous tab, named dossier.