new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

Dec 25

Generalization vs. Specialization: Evaluating Segment Anything Model (SAM3) Zero-Shot Segmentation Against Fine-Tuned YOLO Detectors

Deep learning has advanced two fundamentally different paradigms for instance segmentation: specialized models optimized through task-specific fine-tuning and generalist foundation models capable of zero-shot segmentation. This work presents a comprehensive comparison between SAM3 (Segment Anything Model, also called SAMv3) operating in zero-shot mode and three variants of Ultralytics YOLO11 (nano, medium, and large) fine-tuned for instance segmentation. The evaluation is conducted on the MinneApple dataset, a dense benchmark comprising 670 orchard images with 28,179 annotated apple instances, enabling rigorous validation of model behavior under high object density and occlusion. Our analysis shows IoU choices can inflate performance gaps by up to 30%. At the appropriate IoU = 0.15 threshold, YOLO models achieve 68.9%, 72.2%, and 71.9% F1, while SAM3 reaches 59.8% in pure zero-shot mode. However, YOLO exhibits steep degradation 48-50 points across IoU ranges whereas SAM3 drops only 4 points, revealing 12 times superior boundary stability of SAM3. This highlights the strength of SAMv3 in mask precision versus specialization in detection completeness of YOLO11. We provide open-source code, evaluation pipelines, and methodological recommendations, contributing to a deeper understanding of when specialized fine-tuned models or generalist foundation models are preferable for dense instance segmentation tasks. This project repository is available on GitHub as https://github.com/Applied-AI-Research-Lab/Segment-Anything-Model-SAM3-Zero-Shot-Segmentation-Against-Fine-Tuned-YOLO-Detectors

  • 4 authors
·
Dec 8

Delineate Anything Flow: Fast, Country-Level Field Boundary Detection from Any Source

Accurate delineation of agricultural field boundaries from satellite imagery is essential for land management and crop monitoring, yet existing methods often produce incomplete boundaries, merge adjacent fields, and struggle to scale. We present the Delineate Anything Flow (DelAnyFlow) methodology, a resolution-agnostic approach for large-scale field boundary mapping. DelAnyFlow combines the DelAny instance segmentation model, based on a YOLOv11 backbone and trained on the large-scale Field Boundary Instance Segmentation-22M (FBIS 22M) dataset, with a structured post-processing, merging, and vectorization sequence to generate topologically consistent vector boundaries. FBIS 22M, the largest dataset of its kind, contains 672,909 multi-resolution image patches (0.25-10m) and 22.9million validated field instances. The DelAny model delivers state-of-the-art accuracy with over 100% higher mAP and 400x faster inference than SAM2. DelAny demonstrates strong zero-shot generalization and supports national-scale applications: using Sentinel 2 data for 2024, DelAnyFlow generated a complete field boundary layer for Ukraine (603,000km2) in under six hours on a single workstation. DelAnyFlow outputs significantly improve boundary completeness relative to operational products from Sinergise Solutions and NASA Harvest, particularly in smallholder and fragmented systems (0.25-1ha). For Ukraine, DelAnyFlow delineated 3.75M fields at 5m and 5.15M at 2.5m, compared to 2.66M detected by Sinergise Solutions and 1.69M by NASA Harvest. This work delivers a scalable, cost-effective methodology for field delineation in regions lacking digital cadastral data. A project landing page with links to model weights, code, national-scale vector outputs, and dataset is available at https://lavreniuk.github.io/Delineate-Anything/.

  • 7 authors
·
Nov 17

Transfer learning for galaxy feature detection: Finding Giant Star-forming Clumps in low redshift galaxies using Faster R-CNN

Giant Star-forming Clumps (GSFCs) are areas of intensive star-formation that are commonly observed in high-redshift (z>1) galaxies but their formation and role in galaxy evolution remain unclear. High-resolution observations of low-redshift clumpy galaxy analogues are rare and restricted to a limited set of galaxies but the increasing availability of wide-field galaxy survey data makes the detection of large clumpy galaxy samples increasingly feasible. Deep Learning, and in particular CNNs, have been successfully applied to image classification tasks in astrophysical data analysis. However, one application of DL that remains relatively unexplored is that of automatically identifying and localising specific objects or features in astrophysical imaging data. In this paper we demonstrate the feasibility of using Deep learning-based object detection models to localise GSFCs in astrophysical imaging data. We apply the Faster R-CNN object detection framework (FRCNN) to identify GSFCs in low redshift (z<0.3) galaxies. Unlike other studies, we train different FRCNN models not on simulated images with known labels but on real observational data that was collected by the Sloan Digital Sky Survey Legacy Survey and labelled by volunteers from the citizen science project `Galaxy Zoo: Clump Scout'. The FRCNN model relies on a CNN component as a `backbone' feature extractor. We show that CNNs, that have been pre-trained for image classification using astrophysical images, outperform those that have been pre-trained on terrestrial images. In particular, we compare a domain-specific CNN -`Zoobot' - with a generic classification backbone and find that Zoobot achieves higher detection performance and also requires smaller training data sets to do so. Our final model is capable of producing GSFC detections with a completeness and purity of >=0.8 while only being trained on ~5,000 galaxy images.

  • 11 authors
·
Dec 6, 2023

DetZero: Rethinking Offboard 3D Object Detection with Long-term Sequential Point Clouds

Existing offboard 3D detectors always follow a modular pipeline design to take advantage of unlimited sequential point clouds. We have found that the full potential of offboard 3D detectors is not explored mainly due to two reasons: (1) the onboard multi-object tracker cannot generate sufficient complete object trajectories, and (2) the motion state of objects poses an inevitable challenge for the object-centric refining stage in leveraging the long-term temporal context representation. To tackle these problems, we propose a novel paradigm of offboard 3D object detection, named DetZero. Concretely, an offline tracker coupled with a multi-frame detector is proposed to focus on the completeness of generated object tracks. An attention-mechanism refining module is proposed to strengthen contextual information interaction across long-term sequential point clouds for object refining with decomposed regression methods. Extensive experiments on Waymo Open Dataset show our DetZero outperforms all state-of-the-art onboard and offboard 3D detection methods. Notably, DetZero ranks 1st place on Waymo 3D object detection leaderboard with 85.15 mAPH (L2) detection performance. Further experiments validate the application of taking the place of human labels with such high-quality results. Our empirical study leads to rethinking conventions and interesting findings that can guide future research on offboard 3D object detection.

  • 12 authors
·
Jun 9, 2023

An Overall Real-Time Mechanism for Classification and Quality Evaluation of Rice

Rice is one of the most widely cultivated crops globally and has been developed into numerous varieties. The quality of rice during cultivation is primarily determined by its cultivar and characteristics. Traditionally, rice classification and quality assessment rely on manual visual inspection, a process that is both time-consuming and prone to errors. However, with advancements in machine vision technology, automating rice classification and quality evaluation based on its cultivar and characteristics has become increasingly feasible, enhancing both accuracy and efficiency. This study proposes a real-time evaluation mechanism for comprehensive rice grain assessment, integrating a one-stage object detection approach, a deep convolutional neural network, and traditional machine learning techniques. The proposed framework enables rice variety identification, grain completeness grading, and grain chalkiness evaluation. The rice grain dataset used in this study comprises approximately 20,000 images from six widely cultivated rice varieties in China. Experimental results demonstrate that the proposed mechanism achieves a mean average precision (mAP) of 99.14% in the object detection task and an accuracy of 97.89% in the classification task. Furthermore, the framework attains an average accuracy of 97.56% in grain completeness grading within the same rice variety, contributing to an effective quality evaluation system.

  • 6 authors
·
Feb 19

Can AI Dream of Unseen Galaxies? Conditional Diffusion Model for Galaxy Morphology Augmentation

Observational astronomy relies on visual feature identification to detect critical astrophysical phenomena. While machine learning (ML) increasingly automates this process, models often struggle with generalization in large-scale surveys due to the limited representativeness of labeled datasets -- whether from simulations or human annotation -- a challenge pronounced for rare yet scientifically valuable objects. To address this, we propose a conditional diffusion model to synthesize realistic galaxy images for augmenting ML training data. Leveraging the Galaxy Zoo 2 dataset which contains visual feature -- galaxy image pairs from volunteer annotation, we demonstrate that our model generates diverse, high-fidelity galaxy images closely adhere to the specified morphological feature conditions. Moreover, this model enables generative extrapolation to project well-annotated data into unseen domains and advancing rare object detection. Integrating synthesized images into ML pipelines improves performance in standard morphology classification, boosting completeness and purity by up to 30\% across key metrics. For rare object detection, using early-type galaxies with prominent dust lane features ( sim0.1\% in GZ2 dataset) as a test case, our approach doubled the number of detected instances from 352 to 872, compared to previous studies based on visual inspection. This study highlights the power of generative models to bridge gaps between scarce labeled data and the vast, uncharted parameter space of observational astronomy and sheds insight for future astrophysical foundation model developments. Our project homepage is available at https://galaxysd-webpage.streamlit.app/.

  • 7 authors
·
Jun 19

HMGIE: Hierarchical and Multi-Grained Inconsistency Evaluation for Vision-Language Data Cleansing

Visual-textual inconsistency (VTI) evaluation plays a crucial role in cleansing vision-language data. Its main challenges stem from the high variety of image captioning datasets, where differences in content can create a range of inconsistencies (\eg, inconsistencies in scene, entities, entity attributes, entity numbers, entity interactions). Moreover, variations in caption length can introduce inconsistencies at different levels of granularity as well. To tackle these challenges, we design an adaptive evaluation framework, called Hierarchical and Multi-Grained Inconsistency Evaluation (HMGIE), which can provide multi-grained evaluations covering both accuracy and completeness for various image-caption pairs. Specifically, the HMGIE framework is implemented by three consecutive modules. Firstly, the semantic graph generation module converts the image caption to a semantic graph for building a structural representation of all involved semantic items. Then, the hierarchical inconsistency evaluation module provides a progressive evaluation procedure with a dynamic question-answer generation and evaluation strategy guided by the semantic graph, producing a hierarchical inconsistency evaluation graph (HIEG). Finally, the quantitative evaluation module calculates the accuracy and completeness scores based on the HIEG, followed by a natural language explanation about the detection results. Moreover, to verify the efficacy and flexibility of the proposed framework on handling different image captioning datasets, we construct MVTID, an image-caption dataset with diverse types and granularities of inconsistencies. Extensive experiments on MVTID and other benchmark datasets demonstrate the superior performance of the proposed HMGIE to current state-of-the-art methods.

  • 5 authors
·
Dec 7, 2024