Lychee Guardians

Background and Motivation

For customers far away from lychee-growing regions, knowledge about different lychee variations is often limited. Even for farmers, multiple variations may be cultivated on the same mountain, making it difficult to differentiate them. Compounding this issue, lychees spoil quickly after harvest. In some varieties, their appearance remains unchanged, yet their flavor deteriorates rapidly.

Traditional freshness testing methods usually require peeling the lychee and measuring enzyme activity or chemical matrices—processes that are destructive, time-consuming, and impractical for large-scale use. This creates demand for a fast, non-destructive evaluation method.

Our motivation comes from three directions:

Customer Education – Helping consumers better understand lychee quality.
Industry Demand – Providing growers and distributors with reliable tools to classify and evaluate lychees.
Experimental Validation – Supporting our lychee preservation research with an evaluation pipeline.

Figure 1: Freshly picked lychees for dataset collection.

To answer this, we propose LIHEAP: a recognition and evaluation model designed to non-destructively predict multiple attributes of lychee.

Model Objectives

At the beginning, we designed our model to predict a comprehensive set of attributes:

Post-harvest variation: Days since picking
Flavor metrics: Sugar degree (°Brix), sour degree (total acidity), and pH value
Physical characteristics: Maximum force to spin or pierce the fruit

Through communication with industry stakeholders, we learned that in practice the Brix degree and pH value are the most critical and most commonly measured indicators. This guided the refinement of our model objectives.

Additionally, from our research into fruit testing methods, we found that near-infrared (NIR) imaging can capture internal characteristics such as sugar content, due to the wavelength interaction with sugar molecules. This informed our data collection design.

Figure 2: RGB images reveal external features such as texture, gloss, and color.

Figure 3: NIR imaging captures internal signals that correlate with sugar and acidity.

Dataset Collection

Version 1 – Pilot Dataset

Variations included: 糯米糍 (Nuomici), 妃子笑 (Feizixiao), 桂味 (Guiwei)
Method: Daily RGB images captured with an iPhone for 6 days after picking
Records: 441 samples
Metadata: Image, variation, days after harvest

This first version established our baseline dataset. From here, we realized the importance of efficiency and standardized collection conditions.

Figure 4: Version 1 dataset – establishing a baseline of lychee images across three variations.

Hardware and Software Upgrade

To enable large-scale dataset collection, we designed both hardware and software tools:

Hardware:
- A custom-built shelf for consistent bird’s-eye view positioning of both iPhone (RGB) and NIR camera
- A soldered 12V light band with switch and voltage boosting port for even, controlled lighting

Custom hardware setup for lychee dataset collection

Figure 5: Our self-built hardware system ensures consistent angle, distance, and lighting during data capture.

Software:
- A GUI interface for dual-camera preview (RGB + NIR)
- Crop box drawing and data tagging tools
- Streamlined data capture and labeling workflow

Version 2 – Full Dataset

Variations included: 桂味 (Guiwei), 槐枝 (Huaizhi), 金钟 (Jinzhong), 和砂 (Hesha), 糯米糍 (Nuomici)
Method: RGB + NIR images collected daily for 8 days post-harvest
Records: 1735 samples
Metadata: RGB image, NIR image, variation, days after harvest
Additional Testing: 6 fruits per variation were measured daily for Brix degree and pH value

We have made this dataset Open Access on Zenodo to promote industry development and further research.

Figure 6: Preview of images captured in version 2 dataset – expanded with more varieties, NIR imaging, and paired chemical testing.

Model Development

Leveraging our comprehensive, self-collected dataset, we developed a robust pipeline for lychee variety recognition. Our approach combines multiple state-of-the-art model architectures and an intelligent pre-processing step to ensure high accuracy and usability.

Figure 7 illustrates our classification model pipeline framework.

Figure 7: Preview of the model framework

Multi-Architecture Ensemble

To ensure a comprehensive evaluation and leverage the strengths of different architectural paradigms, we trained and combined three distinct models for the classification task.

ViT (Vision Transformer) : A pure Transformer architecture that excels at capturing global relationships and long-range dependencies within an image. It treats an image as a sequence of patches, applying self-attention mechanisms to learn features.

ViT revolutionizes computer vision by treating images as sequences of patches and applying pure transformer self-attention mechanisms, eliminating the need for convolutional layers.

Figure 7: Preview of the image sequences

MaxViT: A hybrid architecture that combines the local feature extraction power of Convolutional Neural Networks (CNNs) with the global context modeling of Transformers. This allows the model to be both efficient and effective.
ResNet (Residual Network) : A classic and powerful pure CNN architecture. It utilizes residual connections to enable the training of very deep networks, making it excellent at learning hierarchical features from pixels up.

Figure 8: Preview of the ResNet's core innovation: skip connections

Instead of relying on a single model, we employ a soft voting ensemble method. During inference, each of the three models processes the input image and outputs a probability distribution (via softmax) across the lychee varieties. These probability vectors are then averaged to produce a final, combined prediction. This approach enhances robustness and accuracy by mitigating the biases and errors of any single model.

Intelligent Inference Pipeline

To make our model practical and user-friendly, we designed an inference pipeline that automates the most difficult parts of image preparation. Here's the demo video to show the model.

User Input & Segmentation: The process begins in a simple graphical user interface (GUI) where the user can upload an image containing lychees. The user selects the lychee of interest by simply clicking on it. This coordinate is then fed as a prompt to a Segment Anything Model (SAM), which precisely isolates the lychee from its background, creating an accurate object mask.
Image Standardization: The segmented lychee is then placed onto a standardized, neutral background. This crucial step removes distracting background elements and normalizes the input, ensuring the model focuses solely on the fruit's characteristics (color, texture, shape) for classification.
Ensemble Classification: Finally, the standardized image is passed to our ViT, MaxViT, and ResNet ensemble. The soft voting mechanism calculates the final probabilities, and the variety with the highest confidence score is presented to the user as the result.

Preview of classification — Figure 9: Preview of the classification results generated by our recognition program

This pipeline transforms a complex computer vision task into a simple, interactive process, making our technology accessible to farmers, distributors, and consumers alike.

Conclusion

Through iterative dataset collection, hardware/software development, and industry communication, we have built a solid foundation for lychee recognition and evaluation. By making our dataset open access, we not only support our own project goals but also contribute to broader industry and academic progress. Our multi-architecture ensemble model and intelligent inference pipeline provide a powerful and accessible tool for the lychee industry.

Next Step: Integrate dataset with LIHEAP model development results and finalize the evaluation of the model's performance on predicting flavor metrics like Brix and pH.