Get trending papers in your email inbox once a day!
Get trending papers in your email inbox!
SubscribeMatching domain experts by training from scratch on domain knowledge
Recently, large language models (LLMs) have outperformed human experts in predicting the results of neuroscience experiments (Luo et al., 2024). What is the basis for this performance? One possibility is that statistical patterns in that specific scientific literature, as opposed to emergent reasoning abilities arising from broader training, underlie LLMs' performance. To evaluate this possibility, we trained (next word prediction) a relatively small 124M-parameter GPT-2 model on 1.3 billion tokens of domain-specific knowledge. Despite being orders of magnitude smaller than larger LLMs trained on trillions of tokens, small models achieved expert-level performance in predicting neuroscience results. Small models trained on the neuroscience literature succeeded when they were trained from scratch using a tokenizer specifically trained on neuroscience text or when the neuroscience literature was used to finetune a pretrained GPT-2. Our results indicate that expert-level performance may be attained by even small LLMs through domain-specific, auto-regressive training approaches.
Image Representations Learned With Unsupervised Pre-Training Contain Human-like Biases
Recent advances in machine learning leverage massive datasets of unlabeled images from the web to learn general-purpose image representations for tasks from image classification to face recognition. But do unsupervised computer vision models automatically learn implicit patterns and embed social biases that could have harmful downstream effects? We develop a novel method for quantifying biased associations between representations of social concepts and attributes in images. We find that state-of-the-art unsupervised models trained on ImageNet, a popular benchmark image dataset curated from internet images, automatically learn racial, gender, and intersectional biases. We replicate 8 previously documented human biases from social psychology, from the innocuous, as with insects and flowers, to the potentially harmful, as with race and gender. Our results closely match three hypotheses about intersectional bias from social psychology. For the first time in unsupervised computer vision, we also quantify implicit human biases about weight, disabilities, and several ethnicities. When compared with statistical patterns in online image datasets, our findings suggest that machine learning models can automatically learn bias from the way people are stereotypically portrayed on the web.
Exposing Numeracy Gaps: A Benchmark to Evaluate Fundamental Numerical Abilities in Large Language Models
Large Language Models (LLMs) have demonstrated impressive capabilities in natural language processing tasks, such as text generation and semantic understanding. However, their performance on numerical reasoning tasks, such as basic arithmetic, numerical retrieval, and magnitude comparison, remains surprisingly poor. This gap arises from their reliance on surface-level statistical patterns rather than understanding numbers as continuous magnitudes. Existing benchmarks primarily focus on either linguistic competence or structured mathematical problem-solving, neglecting fundamental numerical reasoning required in real-world scenarios. To bridge this gap, we propose NumericBench, a comprehensive benchmark to evaluate six fundamental numerical capabilities: number recognition, arithmetic operations, contextual retrieval, comparison, summary, and logical reasoning. NumericBench includes datasets ranging from synthetic number lists to the crawled real-world data, addressing challenges like long contexts, noise, and multi-step reasoning. Extensive experiments on state-of-the-art LLMs, including GPT-4 and DeepSeek, reveal persistent weaknesses in numerical reasoning, highlighting the urgent need to improve numerically-aware language modeling. The benchmark is released in: https://github.com/TreeAI-Lab/NumericBench.
Sources of Hallucination by Large Language Models on Inference Tasks
Large Language Models (LLMs) are claimed to be capable of Natural Language Inference (NLI), necessary for applied tasks like question answering and summarization. We present a series of behavioral studies on several LLM families (LLaMA, GPT-3.5, and PaLM) which probe their behavior using controlled experiments. We establish two biases originating from pretraining which predict much of their behavior, and show that these are major sources of hallucination in generative LLMs. First, memorization at the level of sentences: we show that, regardless of the premise, models falsely label NLI test samples as entailing when the hypothesis is attested in training data, and that entities are used as ``indices'' to access the memorized data. Second, statistical patterns of usage learned at the level of corpora: we further show a similar effect when the premise predicate is less frequent than that of the hypothesis in the training data, a bias following from previous studies. We demonstrate that LLMs perform significantly worse on NLI test samples which do not conform to these biases than those which do, and we offer these as valuable controls for future LLM evaluation.
Should We Fear Large Language Models? A Structural Analysis of the Human Reasoning System for Elucidating LLM Capabilities and Risks Through the Lens of Heidegger's Philosophy
In the rapidly evolving field of Large Language Models (LLMs), there is a critical need to thoroughly analyze their capabilities and risks. Central to our investigation are two novel elements. Firstly, it is the innovative parallels between the statistical patterns of word relationships within LLMs and Martin Heidegger's concepts of "ready-to-hand" and "present-at-hand," which encapsulate the utilitarian and scientific altitudes humans employ in interacting with the world. This comparison lays the groundwork for positioning LLMs as the digital counterpart to the Faculty of Verbal Knowledge, shedding light on their capacity to emulate certain facets of human reasoning. Secondly, a structural analysis of human reasoning, viewed through Heidegger's notion of truth as "unconcealment" is conducted This foundational principle enables us to map out the inputs and outputs of the reasoning system and divide reasoning into four distinct categories. Respective cognitive faculties are delineated, allowing us to place LLMs within the broader schema of human reasoning, thus clarifying their strengths and inherent limitations. Our findings reveal that while LLMs possess the capability for Direct Explicative Reasoning and Pseudo Rational Reasoning, they fall short in authentic rational reasoning and have no creative reasoning capabilities, due to the current lack of many analogous AI models such as the Faculty of Judgement. The potential and risks of LLMs when they are augmented with other AI technologies are also evaluated. The results indicate that although LLMs have achieved proficiency in some reasoning abilities, the aspiration to match or exceed human intellectual capabilities is yet unattained. This research not only enriches our comprehension of LLMs but also propels forward the discourse on AI's potential and its bounds, paving the way for future explorations into AI's evolving landscape.
Beyond Benchmarks: On The False Promise of AI Regulation
The rapid advancement of artificial intelligence (AI) systems in critical domains like healthcare, justice, and social services has sparked numerous regulatory initiatives aimed at ensuring their safe deployment. Current regulatory frameworks, exemplified by recent US and EU efforts, primarily focus on procedural guidelines while presuming that scientific benchmarking can effectively validate AI safety, similar to how crash tests verify vehicle safety or clinical trials validate drug efficacy. However, this approach fundamentally misunderstands the unique technical challenges posed by modern AI systems. Through systematic analysis of successful technology regulation case studies, we demonstrate that effective scientific regulation requires a causal theory linking observable test outcomes to future performance - for instance, how a vehicle's crash resistance at one speed predicts its safety at lower speeds. We show that deep learning models, which learn complex statistical patterns from training data without explicit causal mechanisms, preclude such guarantees. This limitation renders traditional regulatory approaches inadequate for ensuring AI safety. Moving forward, we call for regulators to reckon with this limitation, and propose a preliminary two-tiered regulatory framework that acknowledges these constraints: mandating human oversight for high-risk applications while developing appropriate risk communication strategies for lower-risk uses. Our findings highlight the urgent need to reconsider fundamental assumptions in AI regulation and suggest a concrete path forward for policymakers and researchers.
AStF: Motion Style Transfer via Adaptive Statistics Fusor
Human motion style transfer allows characters to appear less rigidity and more realism with specific style. Traditional arbitrary image style transfer typically process mean and variance which is proved effective. Meanwhile, similar methods have been adapted for motion style transfer. However, due to the fundamental differences between images and motion, relying on mean and variance is insufficient to fully capture the complex dynamic patterns and spatiotemporal coherence properties of motion data. Building upon this, our key insight is to bring two more coefficient, skewness and kurtosis, into the analysis of motion style. Specifically, we propose a novel Adaptive Statistics Fusor (AStF) which consists of Style Disentanglement Module (SDM) and High-Order Multi-Statistics Attention (HOS-Attn). We trained our AStF in conjunction with a Motion Consistency Regularization (MCR) discriminator. Experimental results show that, by providing a more comprehensive model of the spatiotemporal statistical patterns inherent in dynamic styles, our proposed AStF shows proficiency superiority in motion style transfers over state-of-the-arts. Our code and model are available at https://github.com/CHMimilanlan/AStF.
Training-free Truthfulness Detection via Value Vectors in LLMs
Large language models often generate factually incorrect outputs, motivating efforts to detect the truthfulness of their content. Most existing approaches rely on training probes over internal activations, but these methods suffer from scalability and generalization issues. A recent training-free method, NoVo, addresses this challenge by exploiting statistical patterns from the model itself. However, it focuses exclusively on attention mechanisms, potentially overlooking the MLP module-a core component of Transformer models known to support factual recall. In this paper, we show that certain value vectors within MLP modules exhibit truthfulness-related statistical patterns. Building on this insight, we propose TruthV, a simple and interpretable training-free method that detects content truthfulness by leveraging these value vectors. On the NoVo benchmark, TruthV significantly outperforms both NoVo and log-likelihood baselines, demonstrating that MLP modules-despite being neglected in prior training-free efforts-encode rich and useful signals for truthfulness detection. These findings offer new insights into how truthfulness is internally represented in LLMs and motivate further research on scalable and interpretable truthfulness detection.
TimeMKG: Knowledge-Infused Causal Reasoning for Multivariate Time Series Modeling
Multivariate time series data typically comprises two distinct modalities: variable semantics and sampled numerical observations. Traditional time series models treat variables as anonymous statistical signals, overlooking the rich semantic information embedded in variable names and data descriptions. However, these textual descriptors often encode critical domain knowledge that is essential for robust and interpretable modeling. Here we present TimeMKG, a multimodal causal reasoning framework that elevates time series modeling from low-level signal processing to knowledge informed inference. TimeMKG employs large language models to interpret variable semantics and constructs structured Multivariate Knowledge Graphs that capture inter-variable relationships. A dual-modality encoder separately models the semantic prompts, generated from knowledge graph triplets, and the statistical patterns from historical time series. Cross-modality attention aligns and fuses these representations at the variable level, injecting causal priors into downstream tasks such as forecasting and classification, providing explicit and interpretable priors to guide model reasoning. The experiment in diverse datasets demonstrates that incorporating variable-level knowledge significantly improves both predictive performance and generalization.
Large Language Models: The Need for Nuance in Current Debates and a Pragmatic Perspective on Understanding
Current Large Language Models (LLMs) are unparalleled in their ability to generate grammatically correct, fluent text. LLMs are appearing rapidly, and debates on LLM capacities have taken off, but reflection is lagging behind. Thus, in this position paper, we first zoom in on the debate and critically assess three points recurring in critiques of LLM capacities: i) that LLMs only parrot statistical patterns in the training data; ii) that LLMs master formal but not functional language competence; and iii) that language learning in LLMs cannot inform human language learning. Drawing on empirical and theoretical arguments, we show that these points need more nuance. Second, we outline a pragmatic perspective on the issue of `real' understanding and intentionality in LLMs. Understanding and intentionality pertain to unobservable mental states we attribute to other humans because they have pragmatic value: they allow us to abstract away from complex underlying mechanics and predict behaviour effectively. We reflect on the circumstances under which it would make sense for humans to similarly attribute mental states to LLMs, thereby outlining a pragmatic philosophical context for LLMs as an increasingly prominent technology in society.
Sigma: A dataset for text-to-code semantic parsing with statistical analysis
In the domain of semantic parsing, significant progress has been achieved in Text-to-SQL and question-answering tasks, both of which focus on extracting information from data sources in their native formats. However, the inherent constraints of their formal meaning representations, such as SQL programming language or basic logical forms, hinder their ability to analyze data from various perspectives, such as conducting statistical analyses. To address this limitation and inspire research in this field, we design SIGMA, a new dataset for Text-to-Code semantic parsing with statistical analysis. SIGMA comprises 6000 questions with corresponding Python code labels, spanning across 160 databases. Half of the questions involve query types, which return information in its original format, while the remaining 50% are statistical analysis questions, which perform statistical operations on the data. The Python code labels in our dataset cover 4 types of query types and 40 types of statistical analysis patterns. We evaluated the SIGMA dataset using three different baseline models: LGESQL, SmBoP, and SLSQL. The experimental results show that the LGESQL model with ELECTRA outperforms all other models, achieving 83.37% structure accuracy. In terms of execution accuracy, the SmBoP model, when combined with GraPPa and T5, reaches 76.38%.
Causal Discovery in Astrophysics: Unraveling Supermassive Black Hole and Galaxy Coevolution
Correlation does not imply causation, but patterns of statistical association between variables can be exploited to infer a causal structure (even with purely observational data) with the burgeoning field of causal discovery. As a purely observational science, astrophysics has much to gain by exploiting these new methods. The supermassive black hole (SMBH)--galaxy interaction has long been constrained by observed scaling relations, that is low-scatter correlations between variables such as SMBH mass and the central velocity dispersion of stars in a host galaxy's bulge. This study, using advanced causal discovery techniques and an up-to-date dataset, reveals a causal link between galaxy properties and dynamically-measured SMBH masses. We apply a score-based Bayesian framework to compute the exact conditional probabilities of every causal structure that could possibly describe our galaxy sample. With the exact posterior distribution, we determine the most likely causal structures and notice a probable causal reversal when separating galaxies by morphology. In elliptical galaxies, bulge properties (built from major mergers) tend to influence SMBH growth, while in spiral galaxies, SMBHs are seen to affect host galaxy properties, potentially through feedback in gas-rich environments. For spiral galaxies, SMBHs progressively quench star formation, whereas in elliptical galaxies, quenching is complete, and the causal connection has reversed. Our findings support theoretical models of hierarchical assembly of galaxies and active galactic nuclei feedback regulating galaxy evolution. Our study suggests the potentiality for further exploration of causal links in astrophysical and cosmological scaling relations, as well as any other observational science.
Large Reasoning Embedding Models: Towards Next-Generation Dense Retrieval Paradigm
In modern e-commerce search systems, dense retrieval has become an indispensable component. By computing similarities between query and item (product) embeddings, it efficiently selects candidate products from large-scale repositories. With the breakthroughs in large language models (LLMs), mainstream embedding models have gradually shifted from BERT to LLMs for more accurate text modeling. However, these models still adopt direct-embedding methods, and the semantic accuracy of embeddings remains inadequate. Therefore, contrastive learning is heavily employed to achieve tight semantic alignment between positive pairs. Consequently, such models tend to capture statistical co-occurrence patterns in the training data, biasing them toward shallow lexical and semantic matches. For difficult queries exhibiting notable lexical disparity from target items, the performance degrades significantly. In this work, we propose the Large Reasoning Embedding Model (LREM), which novelly integrates reasoning processes into representation learning. For difficult queries, LREM first conducts reasoning to achieve a deep understanding of the original query, and then produces a reasoning-augmented query embedding for retrieval. This reasoning process effectively bridges the semantic gap between original queries and target items, significantly improving retrieval accuracy. Specifically, we adopt a two-stage training process: the first stage optimizes the LLM on carefully curated Query-CoT-Item triplets with SFT and InfoNCE losses to establish preliminary reasoning and embedding capabilities, and the second stage further refines the reasoning trajectories via reinforcement learning (RL). Extensive offline and online experiments validate the effectiveness of LREM, leading to its deployment on China's largest e-commerce platform since August 2025.
Unveiling the Truth: Exploring Human Gaze Patterns in Fake Images
Creating high-quality and realistic images is now possible thanks to the impressive advancements in image generation. A description in natural language of your desired output is all you need to obtain breathtaking results. However, as the use of generative models grows, so do concerns about the propagation of malicious content and misinformation. Consequently, the research community is actively working on the development of novel fake detection techniques, primarily focusing on low-level features and possible fingerprints left by generative models during the image generation process. In a different vein, in our work, we leverage human semantic knowledge to investigate the possibility of being included in frameworks of fake image detection. To achieve this, we collect a novel dataset of partially manipulated images using diffusion models and conduct an eye-tracking experiment to record the eye movements of different observers while viewing real and fake stimuli. A preliminary statistical analysis is conducted to explore the distinctive patterns in how humans perceive genuine and altered images. Statistical findings reveal that, when perceiving counterfeit samples, humans tend to focus on more confined regions of the image, in contrast to the more dispersed observational pattern observed when viewing genuine images. Our dataset is publicly available at: https://github.com/aimagelab/unveiling-the-truth.
Uncovering delayed patterns in noisy and irregularly sampled time series: an astronomy application
We study the problem of estimating the time delay between two signals representing delayed, irregularly sampled and noisy versions of the same underlying pattern. We propose and demonstrate an evolutionary algorithm for the (hyper)parameter estimation of a kernel-based technique in the context of an astronomical problem, namely estimating the time delay between two gravitationally lensed signals from a distant quasar. Mixed types (integer and real) are used to represent variables within the evolutionary algorithm. We test the algorithm on several artificial data sets, and also on real astronomical observations of quasar Q0957+561. By carrying out a statistical analysis of the results we present a detailed comparison of our method with the most popular methods for time delay estimation in astrophysics. Our method yields more accurate and more stable time delay estimates: for Q0957+561, we obtain 419.6 days for the time delay between images A and B. Our methodology can be readily applied to current state-of-the-art optical monitoring data in astronomy, but can also be applied in other disciplines involving similar time series data.
Unified Software Design Patterns for Simulated Annealing
Any optimization algorithm programming interface can be seen as a black-box function with additional free parameters. In this spirit, simulated annealing (SA) can be implemented in pseudo-code within the dimensions of a single slide with free parameters relating to the annealing schedule. Such an implementation, however, necessarily neglects much of the structure necessary to take advantage of advances in computing resources and algorithmic breakthroughs. Simulated annealing is often introduced in myriad disciplines, from discrete examples like the Traveling Salesman Problem (TSP) to molecular cluster potential energy exploration or even explorations of a protein's configurational space. Theoretical guarantees also demand a stricter structure in terms of statistical quantities, which cannot simply be left to the user. We will introduce several standard paradigms and demonstrate how these can be "lifted" into a unified framework using object-oriented programming in Python. We demonstrate how clean, interoperable, reproducible programming libraries can be used to access and rapidly iterate on variants of Simulated Annealing in a manner which can be extended to serve as a best practices blueprint or design pattern for a data-driven optimization library.
Innovative Cybersickness Detection: Exploring Head Movement Patterns in Virtual Reality
Despite the widespread adoption of Virtual Reality (VR) technology, cybersickness remains a barrier for some users. This research investigates head movement patterns as a novel physiological marker for cybersickness detection. Unlike traditional markers, head movements provide a continuous, non-invasive measure that can be easily captured through the sensors embedded in all commercial VR headsets. We used a publicly available dataset from a VR experiment involving 75 participants and analyzed head movements across six axes. An extensive feature extraction process was then performed on the head movement dataset and its derivatives, including velocity, acceleration, and jerk. Three categories of features were extracted, encompassing statistical, temporal, and spectral features. Subsequently, we employed the Recursive Feature Elimination method to select the most important and effective features. In a series of experiments, we trained a variety of machine learning algorithms. The results demonstrate a 76% accuracy and 83% precision in predicting cybersickness in the subjects based on the head movements. This study contribution to the cybersickness literature lies in offering a preliminary analysis of a new source of data and providing insight into the relationship of head movements and cybersickness.
From time-series to complex networks: Application to the cerebrovascular flow patterns in atrial fibrillation
A network-based approach is presented to investigate the cerebrovascular flow patterns during atrial fibrillation (AF) with respect to normal sinus rhythm (NSR). AF, the most common cardiac arrhythmia with faster and irregular beating, has been recently and independently associated with the increased risk of dementia. However, the underlying hemodynamic mechanisms relating the two pathologies remain mainly undetermined so far; thus the contribution of modeling and refined statistical tools is valuable. Pressure and flow rate temporal series in NSR and AF are here evaluated along representative cerebral sites (from carotid arteries to capillary brain circulation), exploiting reliable artificially built signals recently obtained from an in silico approach. The complex network analysis evidences, in a synthetic and original way, a dramatic signal variation towards the distal/capillary cerebral regions during AF, which has no counterpart in NSR conditions. At the large artery level, networks obtained from both AF and NSR hemodynamic signals exhibit elongated and chained features, which are typical of pseudo-periodic series. These aspects are almost completely lost towards the microcirculation during AF, where the networks are topologically more circular and present random-like characteristics. As a consequence, all the physiological phenomena at microcerebral level ruled by periodicity - such as regular perfusion, mean pressure per beat, and average nutrient supply at cellular level - can be strongly compromised, since the AF hemodynamic signals assume irregular behaviour and random-like features. Through a powerful approach which is complementary to the classical statistical tools, the present findings further strengthen the potential link between AF hemodynamic and cognitive decline.
Harnessing Deep Q-Learning for Enhanced Statistical Arbitrage in High-Frequency Trading: A Comprehensive Exploration
The realm of High-Frequency Trading (HFT) is characterized by rapid decision-making processes that capitalize on fleeting market inefficiencies. As the financial markets become increasingly competitive, there is a pressing need for innovative strategies that can adapt and evolve with changing market dynamics. Enter Reinforcement Learning (RL), a branch of machine learning where agents learn by interacting with their environment, making it an intriguing candidate for HFT applications. This paper dives deep into the integration of RL in statistical arbitrage strategies tailored for HFT scenarios. By leveraging the adaptive learning capabilities of RL, we explore its potential to unearth patterns and devise trading strategies that traditional methods might overlook. We delve into the intricate exploration-exploitation trade-offs inherent in RL and how they manifest in the volatile world of HFT. Furthermore, we confront the challenges of applying RL in non-stationary environments, typical of financial markets, and investigate methodologies to mitigate associated risks. Through extensive simulations and backtests, our research reveals that RL not only enhances the adaptability of trading strategies but also shows promise in improving profitability metrics and risk-adjusted returns. This paper, therefore, positions RL as a pivotal tool for the next generation of HFT-based statistical arbitrage, offering insights for both researchers and practitioners in the field.
BlendX: Complex Multi-Intent Detection with Blended Patterns
Task-oriented dialogue (TOD) systems are commonly designed with the presumption that each utterance represents a single intent. However, this assumption may not accurately reflect real-world situations, where users frequently express multiple intents within a single utterance. While there is an emerging interest in multi-intent detection (MID), existing in-domain datasets such as MixATIS and MixSNIPS have limitations in their formulation. To address these issues, we present BlendX, a suite of refined datasets featuring more diverse patterns than their predecessors, elevating both its complexity and diversity. For dataset construction, we utilize both rule-based heuristics as well as a generative tool -- OpenAI's ChatGPT -- which is augmented with a similarity-driven strategy for utterance selection. To ensure the quality of the proposed datasets, we also introduce three novel metrics that assess the statistical properties of an utterance related to word count, conjunction use, and pronoun usage. Extensive experiments on BlendX reveal that state-of-the-art MID models struggle with the challenges posed by the new datasets, highlighting the need to reexamine the current state of the MID field. The dataset is available at https://github.com/HYU-NLP/BlendX.
GarmentCodeData: A Dataset of 3D Made-to-Measure Garments With Sewing Patterns
Recent research interest in the learning-based processing of garments, from virtual fitting to generation and reconstruction, stumbles on a scarcity of high-quality public data in the domain. We contribute to resolving this need by presenting the first large-scale synthetic dataset of 3D made-to-measure garments with sewing patterns, as well as its generation pipeline. GarmentCodeData contains 115,000 data points that cover a variety of designs in many common garment categories: tops, shirts, dresses, jumpsuits, skirts, pants, etc., fitted to a variety of body shapes sampled from a custom statistical body model based on CAESAR, as well as a standard reference body shape, applying three different textile materials. To enable the creation of datasets of such complexity, we introduce a set of algorithms for automatically taking tailor's measures on sampled body shapes, sampling strategies for sewing pattern design, and propose an automatic, open-source 3D garment draping pipeline based on a fast XPBD simulator, while contributing several solutions for collision resolution and drape correctness to enable scalability. Project Page: https://igl.ethz.ch/projects/GarmentCodeData/
PhiloBERTA: A Transformer-Based Cross-Lingual Analysis of Greek and Latin Lexicons
We present PhiloBERTA, a cross-lingual transformer model that measures semantic relationships between ancient Greek and Latin lexicons. Through analysis of selected term pairs from classical texts, we use contextual embeddings and angular similarity metrics to identify precise semantic alignments. Our results show that etymologically related pairs demonstrate significantly higher similarity scores, particularly for abstract philosophical concepts such as epist\=em\=e (scientia) and dikaiosyn\=e (iustitia). Statistical analysis reveals consistent patterns in these relationships (p = 0.012), with etymologically related pairs showing remarkably stable semantic preservation compared to control pairs. These findings establish a quantitative framework for examining how philosophical concepts moved between Greek and Latin traditions, offering new methods for classical philological research.
Sharing emotions at scale: The Vent dataset
The continuous and increasing use of social media has enabled the expression of human thoughts, opinions, and everyday actions publicly at an unprecedented scale. We present the Vent dataset, the largest annotated dataset of text, emotions, and social connections to date. It comprises more than 33 millions of posts by nearly a million of users together with their social connections. Each post has an associated emotion. There are 705 different emotions, organized in 63 "emotion categories", forming a two-level taxonomy of affects. Our initial statistical analysis describes the global patterns of activity in the Vent platform, revealing large heterogenities and certain remarkable regularities regarding the use of the different emotions. We focus on the aggregated use of emotions, the temporal activity, and the social network of users, and outline possible methods to infer emotion networks based on the user activity. We also analyze the text and describe the affective landscape of Vent, finding agreements with existing (small scale) annotated corpus in terms of emotion categories and positive/negative valences. Finally, we discuss possible research questions that can be addressed from this unique dataset.
Is this Dialogue Coherent? Learning from Dialogue Acts and Entities
In this work, we investigate the human perception of coherence in open-domain dialogues. In particular, we address the problem of annotating and modeling the coherence of next-turn candidates while considering the entire history of the dialogue. First, we create the Switchboard Coherence (SWBD-Coh) corpus, a dataset of human-human spoken dialogues annotated with turn coherence ratings, where next-turn candidate utterances ratings are provided considering the full dialogue context. Our statistical analysis of the corpus indicates how turn coherence perception is affected by patterns of distribution of entities previously introduced and the Dialogue Acts used. Second, we experiment with different architectures to model entities, Dialogue Acts and their combination and evaluate their performance in predicting human coherence ratings on SWBD-Coh. We find that models combining both DA and entity information yield the best performances both for response selection and turn coherence rating.
Towards Reliable Neural Specifications
Having reliable specifications is an unavoidable challenge in achieving verifiable correctness, robustness, and interpretability of AI systems. Existing specifications for neural networks are in the paradigm of data as specification. That is, the local neighborhood centering around a reference input is considered to be correct (or robust). While existing specifications contribute to verifying adversarial robustness, a significant problem in many research domains, our empirical study shows that those verified regions are somewhat tight, and thus fail to allow verification of test set inputs, making them impractical for some real-world applications. To this end, we propose a new family of specifications called neural representation as specification, which uses the intrinsic information of neural networks - neural activation patterns (NAPs), rather than input data to specify the correctness and/or robustness of neural network predictions. We present a simple statistical approach to mining neural activation patterns. To show the effectiveness of discovered NAPs, we formally verify several important properties, such as various types of misclassifications will never happen for a given NAP, and there is no ambiguity between different NAPs. We show that by using NAP, we can verify a significant region of the input space, while still recalling 84% of the data on MNIST. Moreover, we can push the verifiable bound to 10 times larger on the CIFAR10 benchmark. Thus, we argue that NAPs can potentially be used as a more reliable and extensible specification for neural network verification.
Hugging Rain Man: A Novel Facial Action Units Dataset for Analyzing Atypical Facial Expressions in Children with Autism Spectrum Disorder
Children with Autism Spectrum Disorder (ASD) often exhibit atypical facial expressions. However, the specific objective facial features that underlie this subjective perception remain unclear. In this paper, we introduce a novel dataset, Hugging Rain Man (HRM), which includes facial action units (AUs) manually annotated by FACS experts for both children with ASD and typical development (TD). The dataset comprises a rich collection of posed and spontaneous facial expressions, totaling approximately 130,000 frames, along with 22 AUs, 10 Action Descriptors (ADs), and atypicality ratings. A statistical analysis of static images from the HRM reveals significant differences between the ASD and TD groups across multiple AUs and ADs when displaying the same emotional expressions, confirming that participants with ASD tend to demonstrate more irregular and diverse expression patterns. Subsequently, a temporal regression method was presented to analyze atypicality of dynamic sequences, thereby bridging the gap between subjective perception and objective facial characteristics. Furthermore, baseline results for AU detection are provided for future research reference. This work not only contributes to our understanding of the unique facial expression characteristics associated with ASD but also provides potential tools for ASD early screening. Portions of the dataset, features, and pretrained models are accessible at: https://github.com/Jonas-DL/Hugging-Rain-Man.
Parallel Learning by Multitasking Neural Networks
A modern challenge of Artificial Intelligence is learning multiple patterns at once (i.e.parallel learning). While this can not be accomplished by standard Hebbian associative neural networks, in this paper we show how the Multitasking Hebbian Network (a variation on theme of the Hopfield model working on sparse data-sets) is naturally able to perform this complex task. We focus on systems processing in parallel a finite (up to logarithmic growth in the size of the network) amount of patterns, mirroring the low-storage level of standard associative neural networks at work with pattern recognition. For mild dilution in the patterns, the network handles them hierarchically, distributing the amplitudes of their signals as power-laws w.r.t. their information content (hierarchical regime), while, for strong dilution, all the signals pertaining to all the patterns are raised with the same strength (parallel regime). Further, confined to the low-storage setting (i.e., far from the spin glass limit), the presence of a teacher neither alters the multitasking performances nor changes the thresholds for learning: the latter are the same whatever the training protocol is supervised or unsupervised. Results obtained through statistical mechanics, signal-to-noise technique and Monte Carlo simulations are overall in perfect agreement and carry interesting insights on multiple learning at once: for instance, whenever the cost-function of the model is minimized in parallel on several patterns (in its description via Statistical Mechanics), the same happens to the standard sum-squared error Loss function (typically used in Machine Learning).
Measuring the Stability of EHR- and EKG-based Predictive Models
Databases of electronic health records (EHRs) are increasingly used to inform clinical decisions. Machine learning methods can find patterns in EHRs that are predictive of future adverse outcomes. However, statistical models may be built upon patterns of health-seeking behavior that vary across patient subpopulations, leading to poor predictive performance when training on one patient population and predicting on another. This note proposes two tests to better measure and understand model generalization. We use these tests to compare models derived from two data sources: (i) historical medical records, and (ii) electrocardiogram (EKG) waveforms. In a predictive task, we show that EKG-based models can be more stable than EHR-based models across different patient populations.
Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse
Chain-of-thought (CoT) prompting has become a widely used strategy for working with large language and multimodal models. While CoT has been shown to improve performance across many tasks, determining the settings in which it is effective remains an ongoing effort. In particular, it is still an open question in what settings CoT systematically reduces model performance. In this paper, we seek to identify the characteristics of tasks where CoT reduces performance by drawing inspiration from cognitive psychology, looking at cases where (i) verbal thinking or deliberation hurts performance in humans, and (ii) the constraints governing human performance generalize to language models. Three such cases are implicit statistical learning, visual recognition, and classifying with patterns containing exceptions. In extensive experiments across all three settings, we find that a diverse collection of state-of-the-art models exhibit significant drop-offs in performance (e.g., up to 36.3% absolute accuracy for OpenAI o1-preview compared to GPT-4o) when using inference-time reasoning compared to zero-shot counterparts. We also identify three tasks that satisfy condition (i) but not (ii), and find that while verbal thinking reduces human performance in these tasks, CoT retains or increases model performance. Overall, our results show that while there is not an exact parallel between the cognitive processes of models and those of humans, considering cases where thinking has negative consequences for human performance can help us identify settings where it negatively impacts models. By connecting the literature on human deliberation with evaluations of CoT, we offer a new tool that can be used in understanding the impact of prompt choices and inference-time reasoning.
PECCARY: A novel approach for characterizing orbital complexity, stochasticity, and regularity
Permutation Entropy and statistiCal Complexity Analysis for astRophYsics (PECCARY) is a computationally inexpensive, statistical method by which any time-series can be characterized as predominantly regular, complex, or stochastic. Elements of the PECCARY method have been used in a variety of physical, biological, economic, and mathematical scenarios, but have not yet gained traction in the astrophysical community. This study introduces the PECCARY technique with the specific aims to motivate its use in and optimize it for the analysis of astrophysical orbital systems. PECCARY works by decomposing a time-dependent measure, such as the x-coordinate or orbital angular momentum time-series, into ordinal patterns. Due to its unique approach and statistical nature, PECCARY is well-suited for detecting preferred and forbidden patterns (a signature of chaos), even when the chaotic behavior is short-lived or when working with a relatively short duration time-series or small sets of time-series data. A variety of examples are used to demonstrate the capabilities of PECCARY. These include mathematical examples (sine waves, varieties of noise, sums of sine waves, well-known chaotic functions), a double pendulum system, and astrophysical tracer particle simulations with potentials of varying intricacies. Since the adopted timescale used to diagnose a given time-series can affect the outcome, a method is presented to identify an ideal sampling scheme, constrained by the overall duration and the natural timescale of the system. The accompanying PECCARY Python package and its usage are discussed.
Facial age estimation using BSIF and LBP
Human face aging is irreversible process causing changes in human face characteristics such us hair whitening, muscles drop and wrinkles. Due to the importance of human face aging in biometrics systems, age estimation became an attractive area for researchers. This paper presents a novel method to estimate the age from face images, using binarized statistical image features (BSIF) and local binary patterns (LBP)histograms as features performed by support vector regression (SVR) and kernel ridge regression (KRR). We applied our method on FG-NET and PAL datasets. Our proposed method has shown superiority to that of the state-of-the-art methods when using the whole PAL database.
LAPO: Internalizing Reasoning Efficiency via Length-Adaptive Policy Optimization
Large reasoning models have achieved remarkable performance through extended chain-of-thought sequences, yet this computational freedom leads to excessive token generation even for simple problems. We present Length-Adaptive Policy Optimization (LAPO), a novel framework that transforms reasoning length control from an external constraint into an intrinsic model capability. Unlike existing approaches that impose rigid limits or rely on post-hoc interventions, LAPO enables models to internalize an understanding of appropriate reasoning depth through a two-stage reinforcement learning process. In the first stage, models learn natural reasoning patterns by discovering the statistical distribution of successful solution lengths. The second stage leverages these patterns as meta-cognitive guidance, embedding them directly within the model's reasoning context to ensure inference-time flexibility. Experiments on mathematical reasoning benchmarks demonstrate that LAPO reduces token usage by up to 40.9\% while improving accuracy by 2.3\%. Our analysis reveals that models trained with LAPO develop emergent abilities to allocate computational resources based on problem complexity, achieving efficient reasoning without sacrificing quality.
Self-supervised Spatio-temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics
We address the problem of video representation learning without human-annotated labels. While previous efforts address the problem by designing novel self-supervised tasks using video data, the learned features are merely on a frame-by-frame basis, which are not applicable to many video analytic tasks where spatio-temporal features are prevailing. In this paper we propose a novel self-supervised approach to learn spatio-temporal features for video representation. Inspired by the success of two-stream approaches in video classification, we propose to learn visual features by regressing both motion and appearance statistics along spatial and temporal dimensions, given only the input video data. Specifically, we extract statistical concepts (fast-motion region and the corresponding dominant direction, spatio-temporal color diversity, dominant color, etc.) from simple patterns in both spatial and temporal domains. Unlike prior puzzles that are even hard for humans to solve, the proposed approach is consistent with human inherent visual habits and therefore easy to answer. We conduct extensive experiments with C3D to validate the effectiveness of our proposed approach. The experiments show that our approach can significantly improve the performance of C3D when applied to video classification tasks. Code is available at https://github.com/laura-wang/video_repres_mas.
Building Information Modeling and Classification by Visual Learning At A City Scale
In this paper, we provide two case studies to demonstrate how artificial intelligence can empower civil engineering. In the first case, a machine learning-assisted framework, BRAILS, is proposed for city-scale building information modeling. Building information modeling (BIM) is an efficient way of describing buildings, which is essential to architecture, engineering, and construction. Our proposed framework employs deep learning technique to extract visual information of buildings from satellite/street view images. Further, a novel machine learning (ML)-based statistical tool, SURF, is proposed to discover the spatial patterns in building metadata. The second case focuses on the task of soft-story building classification. Soft-story buildings are a type of buildings prone to collapse during a moderate or severe earthquake. Hence, identifying and retrofitting such buildings is vital in the current earthquake preparedness efforts. For this task, we propose an automated deep learning-based procedure for identifying soft-story buildings from street view images at a regional scale. We also create a large-scale building image database and a semi-automated image labeling approach that effectively annotates new database entries. Through extensive computational experiments, we demonstrate the effectiveness of the proposed method.
A Time Series Analysis-Based Stock Price Prediction Using Machine Learning and Deep Learning Models
Prediction of future movement of stock prices has always been a challenging task for the researchers. While the advocates of the efficient market hypothesis (EMH) believe that it is impossible to design any predictive framework that can accurately predict the movement of stock prices, there are seminal work in the literature that have clearly demonstrated that the seemingly random movement patterns in the time series of a stock price can be predicted with a high level of accuracy. Design of such predictive models requires choice of appropriate variables, right transformation methods of the variables, and tuning of the parameters of the models. In this work, we present a very robust and accurate framework of stock price prediction that consists of an agglomeration of statistical, machine learning and deep learning models. We use the daily stock price data, collected at five minutes interval of time, of a very well known company that is listed in the National Stock Exchange (NSE) of India. The granular data is aggregated into three slots in a day, and the aggregated data is used for building and training the forecasting models. We contend that the agglomerative approach of model building that uses a combination of statistical, machine learning, and deep learning approaches, can very effectively learn from the volatile and random movement patterns in a stock price data. We build eight classification and eight regression models based on statistical and machine learning approaches. In addition to these models, a deep learning regression model using a long-and-short-term memory (LSTM) network is also built. Extensive results have been presented on the performance of these models, and the results are critically analyzed.
Fantastic Bugs and Where to Find Them in AI Benchmarks
Benchmarks are pivotal in driving AI progress, and invalid benchmark questions frequently undermine their reliability. Manually identifying and correcting errors among thousands of benchmark questions is not only infeasible but also a critical bottleneck for reliable evaluation. In this work, we introduce a framework for systematic benchmark revision that leverages statistical analysis of response patterns to flag potentially invalid questions for further expert review. Our approach builds on a core assumption commonly used in AI evaluations that the mean score sufficiently summarizes model performance. This implies a unidimensional latent construct underlying the measurement experiment, yielding expected ranges for various statistics for each item. When empirically estimated values for these statistics fall outside the expected range for an item, the item is more likely to be problematic. Across nine widely used benchmarks, our method guides expert review to identify problematic questions with up to 84\% precision. In addition, we introduce an LLM-judge first pass to review questions, further reducing human effort. Together, these components provide an efficient and scalable framework for systematic benchmark revision.
Small but Mighty: Enhancing Time Series Forecasting with Lightweight LLMs
While LLMs have demonstrated remarkable potential in time series forecasting, their practical deployment remains constrained by excessive computational demands and memory footprints. Existing LLM-based approaches typically suffer from three critical limitations: Inefficient parameter utilization in handling numerical time series patterns; Modality misalignment between continuous temporal signals and discrete text embeddings; and Inflexibility for real-time expert knowledge integration. We present SMETimes, the first systematic investigation of sub-3B parameter SLMs for efficient and accurate time series forecasting. Our approach centers on three key innovations: A statistically-enhanced prompting mechanism that bridges numerical time series with textual semantics through descriptive statistical features; A adaptive fusion embedding architecture that aligns temporal patterns with language model token spaces through learnable parameters; And a dynamic mixture-of-experts framework enabled by SLMs' computational efficiency, adaptively combining base predictions with domain-specific models. Extensive evaluations across seven benchmark datasets demonstrate that our 3B-parameter SLM achieves state-of-the-art performance on five primary datasets while maintaining 3.8x faster training and 5.2x lower memory consumption compared to 7B-parameter LLM baselines. Notably, the proposed model exhibits better learning capabilities, achieving 12.3% lower MSE than conventional LLM. Ablation studies validate that our statistical prompting and cross-modal fusion modules respectively contribute 15.7% and 18.2% error reduction in long-horizon forecasting tasks. By redefining the efficiency-accuracy trade-off landscape, this work establishes SLMs as viable alternatives to resource-intensive LLMs for practical time series forecasting. Code and models are available at https://github.com/xiyan1234567/SMETimes.
Do computer vision foundation models learn the low-level characteristics of the human visual system?
Computer vision foundation models, such as DINO or OpenCLIP, are trained in a self-supervised manner on large image datasets. Analogously, substantial evidence suggests that the human visual system (HVS) is influenced by the statistical distribution of colors and patterns in the natural world, characteristics also present in the training data of foundation models. The question we address in this paper is whether foundation models trained on natural images mimic some of the low-level characteristics of the human visual system, such as contrast detection, contrast masking, and contrast constancy. Specifically, we designed a protocol comprising nine test types to evaluate the image encoders of 45 foundation and generative models. Our results indicate that some foundation models (e.g., DINO, DINOv2, and OpenCLIP), share some of the characteristics of human vision, but other models show little resemblance. Foundation models tend to show smaller sensitivity to low contrast and rather irregular responses to contrast across frequencies. The foundation models show the best agreement with human data in terms of contrast masking. Our findings suggest that human vision and computer vision may take both similar and different paths when learning to interpret images of the real world. Overall, while differences remain, foundation models trained on vision tasks start to align with low-level human vision, with DINOv2 showing the closest resemblance.
Hierarchical cycle-tree packing model for $K$-core attack problem
The K-core of a graph is the unique maximum subgraph within which each vertex connects to K or more other vertices. The optimal K-core attack problem asks to delete the minimum number of vertices from the K-core to induce its complete collapse. A hierarchical cycle-tree packing model is introduced here for this challenging combinatorial optimization problem. We convert the temporally long-range correlated K-core pruning dynamics into locally tree-like static patterns and analyze this model through the replica-symmetric cavity method of statistical physics. A set of coarse-grained belief propagation equations are derived to predict single vertex marginal probabilities efficiently. The associated hierarchical cycle-tree guided attack ({\tt hCTGA}) algorithm is able to construct nearly optimal attack solutions for regular random graphs and Erd\"os-R\'enyi random graphs. Our cycle-tree packing model may also be helpful for constructing optimal initial conditions for other irreversible dynamical processes on sparse random graphs.
Superposed Episodic and Semantic Memory via Sparse Distributed Representation
The abilities to perceive, learn, and use generalities, similarities, classes, i.e., semantic memory (SM), is central to cognition. Machine learning (ML), neural network, and AI research has been primarily driven by tasks requiring such abilities. However, another central facet of cognition, single-trial formation of permanent memories of experiences, i.e., episodic memory (EM), has had relatively little focus. Only recently has EM-like functionality been added to Deep Learning (DL) models, e.g., Neural Turing Machine, Memory Networks. However, in these cases: a) EM is implemented as a separate module, which entails substantial data movement (and so, time and power) between the DL net itself and EM; and b) individual items are stored localistically within the EM, precluding realizing the exponential representational efficiency of distributed over localist coding. We describe Sparsey, an unsupervised, hierarchical, spatial/spatiotemporal associative memory model differing fundamentally from mainstream ML models, most crucially, in its use of sparse distributed representations (SDRs), or, cell assemblies, which admits an extremely efficient, single-trial learning algorithm that maps input similarity into code space similarity (measured as intersection). SDRs of individual inputs are stored in superposition and because similarity is preserved, the patterns of intersections over the assigned codes reflect the similarity, i.e., statistical, structure, of all orders, not simply pairwise, over the inputs. Thus, SM, i.e., a generative model, is built as a computationally free side effect of the act of storing episodic memory traces of individual inputs, either spatial patterns or sequences. We report initial results on MNIST and on the Weizmann video event recognition benchmarks. While we have not yet attained SOTA class accuracy, learning takes only minutes on a single CPU.
BLAST: Balanced Sampling Time Series Corpus for Universal Forecasting Models
The advent of universal time series forecasting models has revolutionized zero-shot forecasting across diverse domains, yet the critical role of data diversity in training these models remains underexplored. Existing large-scale time series datasets often suffer from inherent biases and imbalanced distributions, leading to suboptimal model performance and generalization. To address this gap, we introduce BLAST, a novel pre-training corpus designed to enhance data diversity through a balanced sampling strategy. First, BLAST incorporates 321 billion observations from publicly available datasets and employs a comprehensive suite of statistical metrics to characterize time series patterns. Then, to facilitate pattern-oriented sampling, the data is implicitly clustered using grid-based partitioning. Furthermore, by integrating grid sampling and grid mixup techniques, BLAST ensures a balanced and representative coverage of diverse patterns. Experimental results demonstrate that models pre-trained on BLAST achieve state-of-the-art performance with a fraction of the computational resources and training tokens required by existing methods. Our findings highlight the pivotal role of data diversity in improving both training efficiency and model performance for the universal forecasting task.
FailureSensorIQ: A Multi-Choice QA Dataset for Understanding Sensor Relationships and Failure Modes
We introduce FailureSensorIQ, a novel Multi-Choice Question-Answering (MCQA) benchmarking system designed to assess the ability of Large Language Models (LLMs) to reason and understand complex, domain-specific scenarios in Industry 4.0. Unlike traditional QA benchmarks, our system focuses on multiple aspects of reasoning through failure modes, sensor data, and the relationships between them across various industrial assets. Through this work, we envision a paradigm shift where modeling decisions are not only data-driven using statistical tools like correlation analysis and significance tests, but also domain-driven by specialized LLMs which can reason about the key contributors and useful patterns that can be captured with feature engineering. We evaluate the Industrial knowledge of over a dozen LLMs-including GPT-4, Llama, and Mistral-on FailureSensorIQ from different lens using Perturbation-Uncertainty-Complexity analysis, Expert Evaluation study, Asset-Specific Knowledge Gap analysis, ReAct agent using external knowledge-bases. Even though closed-source models with strong reasoning capabilities approach expert-level performance, the comprehensive benchmark reveals a significant drop in performance that is fragile to perturbations, distractions, and inherent knowledge gaps in the models. We also provide a real-world case study of how LLMs can drive the modeling decisions on 3 different failure prediction datasets related to various assets. We release: (a) expert-curated MCQA for various industrial assets, (b) FailureSensorIQ benchmark and Hugging Face leaderboard based on MCQA built from non-textual data found in ISO documents, and (c) LLMFeatureSelector, an LLM-based feature selection scikit-learn pipeline. The software is available at https://github.com/IBM/FailureSensorIQ.
Object Classification in Images of Neoclassical Artifacts Using Deep Learning
In this paper, we report on our efforts for using Deep Learning for classifying artifacts and their features in digital visuals as a part of the Neoclassica framework. It was conceived to provide scholars with new methods for analyzing and classifying artifacts and aesthetic forms from the era of Classicism. The framework accommodates both traditional knowledge representation as a formal ontology and data-driven knowledge discovery, where cultural patterns will be identified by means of algorithms in statistical analysis and machine learning. We created a Deep Learning approach trained on photographs to classify the objects inside these photographs. In a next step, we will apply a different Deep Learning approach. It is capable of locating multiple objects inside an image and classifying them with a high accuracy.
Association rule mining with earthquake data collected from Turkiye region
Earthquakes are evaluated among the most destructive disasters for human beings, as also experienced for Turkiye region. Data science has the property of discovering hidden patterns in case a sufficient volume of data is supplied. Time dependency of events, specifically being defined by co-occurrence in a specific time window, may be handled as an associate rule mining task such as a market-basket analysis application. In this regard, we assumed each day's seismic activity as a single basket of events, leading to discovering the association patterns between these events. Consequently, this study presents the most prominent association rules for the earthquakes recorded in Turkiye region in the last 5 years, each year presented separately. Results indicate statistical inference with events recorded from regions of various distances, which could be further verified with geologic evidence from the field. As a result, we believe that the current study may form a statistical basis for the future works with the aid of machine learning algorithm performed for associate rule mining.
Dichotomic Pattern Mining with Applications to Intent Prediction from Semi-Structured Clickstream Datasets
We introduce a pattern mining framework that operates on semi-structured datasets and exploits the dichotomy between outcomes. Our approach takes advantage of constraint reasoning to find sequential patterns that occur frequently and exhibit desired properties. This allows the creation of novel pattern embeddings that are useful for knowledge extraction and predictive modeling. Finally, we present an application on customer intent prediction from digital clickstream data. Overall, we show that pattern embeddings play an integrator role between semi-structured data and machine learning models, improve the performance of the downstream task and retain interpretability.
On the Notion that Language Models Reason
Language models (LMs) are said to be exhibiting reasoning, but what does this entail? We assess definitions of reasoning and how key papers in the field of natural language processing (NLP) use the notion and argue that the definitions provided are not consistent with how LMs are trained, process information, and generate new tokens. To illustrate this incommensurability we assume the view that transformer-based LMs implement an implicit finite-order Markov kernel mapping contexts to conditional token distributions. In this view, reasoning-like outputs correspond to statistical regularities and approximate statistical invariances in the learned kernel rather than the implementation of explicit logical mechanisms. This view is illustrative of the claim that LMs are "statistical pattern matchers"" and not genuine reasoners and provides a perspective that clarifies why reasoning-like outputs arise in LMs without any guarantees of logical consistency. This distinction is fundamental to how epistemic uncertainty is evaluated in LMs. We invite a discussion on the importance of how the computational processes of the systems we build and analyze in NLP research are described.
WAPITI: A Watermark for Finetuned Open-Source LLMs
Watermarking of large language models (LLMs) generation embeds an imperceptible statistical pattern within texts, making it algorithmically detectable. Watermarking is a promising method for addressing potential harm and biases from LLMs, as it enables traceability, accountability, and detection of manipulated content, helping to mitigate unintended consequences. However, for open-source models, watermarking faces two major challenges: (i) incompatibility with fine-tuned models, and (ii) vulnerability to fine-tuning attacks. In this work, we propose WAPITI, a new method that transfers watermarking from base models to fine-tuned models through parameter integration. To the best of our knowledge, we propose the first watermark for fine-tuned open-source LLMs that preserves their fine-tuned capabilities. Furthermore, our approach offers an effective defense against fine-tuning attacks. We test our method on various model architectures and watermarking strategies. Results demonstrate that our method can successfully inject watermarks and is highly compatible with fine-tuned models. Additionally, we offer an in-depth analysis of how parameter editing influences the watermark strength and overall capabilities of the resulting models.
Solving Data Quality Problems with Desbordante: a Demo
Data profiling is an essential process in modern data-driven industries. One of its critical components is the discovery and validation of complex statistics, including functional dependencies, data constraints, association rules, and others. However, most existing data profiling systems that focus on complex statistics do not provide proper integration with the tools used by contemporary data scientists. This creates a significant barrier to the adoption of these tools in the industry. Moreover, existing systems were not created with industrial-grade workloads in mind. Finally, they do not aim to provide descriptive explanations, i.e. why a given pattern is not found. It is a significant issue as it is essential to understand the underlying reasons for a specific pattern's absence to make informed decisions based on the data. Because of that, these patterns are effectively rest in thin air: their application scope is rather limited, they are rarely used by the broader public. At the same time, as we are going to demonstrate in this presentation, complex statistics can be efficiently used to solve many classic data quality problems. Desbordante is an open-source data profiler that aims to close this gap. It is built with emphasis on industrial application: it is efficient, scalable, resilient to crashes, and provides explanations. Furthermore, it provides seamless Python integration by offloading various costly operations to the C++ core, not only mining. In this demonstration, we show several scenarios that allow end users to solve different data quality problems. Namely, we showcase typo detection, data deduplication, and data anomaly detection scenarios.
A Review of Deep Learning with Special Emphasis on Architectures, Applications and Recent Trends
Deep learning has solved a problem that as little as five years ago was thought by many to be intractable - the automatic recognition of patterns in data; and it can do so with accuracy that often surpasses human beings. It has solved problems beyond the realm of traditional, hand-crafted machine learning algorithms and captured the imagination of practitioners trying to make sense out of the flood of data that now inundates our society. As public awareness of the efficacy of DL increases so does the desire to make use of it. But even for highly trained professionals it can be daunting to approach the rapidly increasing body of knowledge produced by experts in the field. Where does one start? How does one determine if a particular model is applicable to their problem? How does one train and deploy such a network? A primer on the subject can be a good place to start. With that in mind, we present an overview of some of the key multilayer ANNs that comprise DL. We also discuss some new automatic architecture optimization protocols that use multi-agent approaches. Further, since guaranteeing system uptime is becoming critical to many computer applications, we include a section on using neural networks for fault detection and subsequent mitigation. This is followed by an exploratory survey of several application areas where DL has emerged as a game-changing technology: anomalous behavior detection in financial applications or in financial time-series forecasting, predictive and prescriptive analytics, medical image processing and analysis and power systems research. The thrust of this review is to outline emerging areas of application-oriented research within the DL community as well as to provide a reference to researchers seeking to use it in their work for what it does best: statistical pattern recognition with unparalleled learning capacity with the ability to scale with information.
Introduction to Machine Learning
This book introduces the mathematical foundations and techniques that lead to the development and analysis of many of the algorithms that are used in machine learning. It starts with an introductory chapter that describes notation used throughout the book and serve at a reminder of basic concepts in calculus, linear algebra and probability and also introduces some measure theoretic terminology, which can be used as a reading guide for the sections that use these tools. The introductory chapters also provide background material on matrix analysis and optimization. The latter chapter provides theoretical support to many algorithms that are used in the book, including stochastic gradient descent, proximal methods, etc. After discussing basic concepts for statistical prediction, the book includes an introduction to reproducing kernel theory and Hilbert space techniques, which are used in many places, before addressing the description of various algorithms for supervised statistical learning, including linear methods, support vector machines, decision trees, boosting, or neural networks. The subject then switches to generative methods, starting with a chapter that presents sampling methods and an introduction to the theory of Markov chains. The following chapter describe the theory of graphical models, an introduction to variational methods for models with latent variables, and to deep-learning based generative models. The next chapters focus on unsupervised learning methods, for clustering, factor analysis and manifold learning. The final chapter of the book is theory-oriented and discusses concentration inequalities and generalization bounds.
Decomposition of Time Series Data of Stock Markets and its Implications for Prediction: An Application for the Indian Auto Sector
With the rapid development and evolution of sophisticated algorithms for statistical analysis of time series data, the research community has started spending considerable effort in technical analysis of such data. Forecasting is also an area which has witnessed a paradigm shift in its approach. In this work, we have used the time series of the index values of the Auto sector in India during January 2010 to December 2015 for a deeper understanding of the behavior of its three constituent components, e.g., the Trend, the Seasonal component, and the Random component. Based on this structural analysis, we have also designed three approaches for forecasting and also computed their accuracy in prediction using suitably chosen training and test data sets. The results clearly demonstrate the accuracy of our decomposition results and efficiency of our forecasting techniques, even in presence of a dominant Random component in the time series.
Preserving Statistical Validity in Adaptive Data Analysis
A great deal of effort has been devoted to reducing the risk of spurious scientific discoveries, from the use of sophisticated validation techniques, to deep statistical methods for controlling the false discovery rate in multiple hypothesis testing. However, there is a fundamental disconnect between the theoretical results and the practice of data analysis: the theory of statistical inference assumes a fixed collection of hypotheses to be tested, or learning algorithms to be applied, selected non-adaptively before the data are gathered, whereas in practice data is shared and reused with hypotheses and new analyses being generated on the basis of data exploration and the outcomes of previous analyses. In this work we initiate a principled study of how to guarantee the validity of statistical inference in adaptive data analysis. As an instance of this problem, we propose and investigate the question of estimating the expectations of m adaptively chosen functions on an unknown distribution given n random samples. We show that, surprisingly, there is a way to estimate an exponential in n number of expectations accurately even if the functions are chosen adaptively. This gives an exponential improvement over standard empirical estimators that are limited to a linear number of estimates. Our result follows from a general technique that counter-intuitively involves actively perturbing and coordinating the estimates, using techniques developed for privacy preservation. We give additional applications of this technique to our question.
Heaps' law and Heaps functions in tagged texts: Evidences of their linguistic relevance
We study the relationship between vocabulary size and text length in a corpus of 75 literary works in English, authored by six writers, distinguishing between the contributions of three grammatical classes (or ``tags,'' namely, {\it nouns}, {\it verbs}, and {\it others}), and analyze the progressive appearance of new words of each tag along each individual text. While the power-law relation prescribed by Heaps' law is satisfactorily fulfilled by total vocabulary sizes and text lengths, the appearance of new words in each text is on the whole well described by the average of random shufflings of the text, which does not obey a power law. Deviations from this average, however, are statistically significant and show a systematic trend across the corpus. Specifically, they reveal that the appearance of new words along each text is predominantly retarded with respect to the average of random shufflings. Moreover, different tags are shown to add systematically distinct contributions to this tendency, with {\it verbs} and {\it others} being respectively more and less retarded than the mean trend, and {\it nouns} following instead this overall mean. These statistical systematicities are likely to point to the existence of linguistically relevant information stored in the different variants of Heaps' law, a feature that is still in need of extensive assessment.
Bitcoin Price Predictive Modeling Using Expert Correction
The paper studies the linear model for Bitcoin price which includes regression features based on Bitcoin currency statistics, mining processes, Google search trends, Wikipedia pages visits. The pattern of deviation of regression model prediction from real prices is simpler comparing to price time series. It is assumed that this pattern can be predicted by an experienced expert. In such a way, using the combination of the regression model and expert correction, one can receive better results than with either regression model or expert opinion only. It is shown that Bayesian approach makes it possible to utilize the probabilistic approach using distributions with fat tails and take into account the outliers in Bitcoin price time series.
Detecting Dataset Drift and Non-IID Sampling via k-Nearest Neighbors
We present a straightforward statistical test to detect certain violations of the assumption that the data are Independent and Identically Distributed (IID). The specific form of violation considered is common across real-world applications: whether the examples are ordered in the dataset such that almost adjacent examples tend to have more similar feature values (e.g. due to distributional drift, or attractive interactions between datapoints). Based on a k-Nearest Neighbors estimate, our approach can be used to audit any multivariate numeric data as well as other data types (image, text, audio, etc.) that can be numerically represented, perhaps with model embeddings. Compared with existing methods to detect drift or auto-correlation, our approach is both applicable to more types of data and also able to detect a wider variety of IID violations in practice. Code: https://github.com/cleanlab/cleanlab
Frequentism and Bayesianism: A Python-driven Primer
This paper presents a brief, semi-technical comparison of the essential features of the frequentist and Bayesian approaches to statistical inference, with several illustrative examples implemented in Python. The differences between frequentism and Bayesianism fundamentally stem from differing definitions of probability, a philosophical divide which leads to distinct approaches to the solution of statistical problems as well as contrasting ways of asking and answering questions about unknown parameters. After an example-driven discussion of these differences, we briefly compare several leading Python statistical packages which implement frequentist inference using classical methods and Bayesian inference using Markov Chain Monte Carlo.
Delving into the Utilisation of ChatGPT in Scientific Publications in Astronomy
Rapid progress in the capabilities of machine learning approaches in natural language processing has culminated in the rise of large language models over the last two years. Recent works have shown unprecedented adoption of these for academic writing, especially in some fields, but their pervasiveness in astronomy has not been studied sufficiently. To remedy this, we extract words that ChatGPT uses more often than humans when generating academic text and search a total of 1 million articles for them. This way, we assess the frequency of word occurrence in published works in astronomy tracked by the NASA Astrophysics Data System since 2000. We then perform a statistical analysis of the occurrences. We identify a list of words favoured by ChatGPT and find a statistically significant increase for these words against a control group in 2024, which matches the trend in other disciplines. These results suggest a widespread adoption of these models in the writing of astronomy papers. We encourage organisations, publishers, and researchers to work together to identify ethical and pragmatic guidelines to maximise the benefits of these systems while maintaining scientific rigour.
Phase Transitions in the Detection of Correlated Databases
We study the problem of detecting the correlation between two Gaussian databases XinR^{ntimes d} and Y^{ntimes d}, each composed of n users with d features. This problem is relevant in the analysis of social media, computational biology, etc. We formulate this as a hypothesis testing problem: under the null hypothesis, these two databases are statistically independent. Under the alternative, however, there exists an unknown permutation sigma over the set of n users (or, row permutation), such that X is rho-correlated with Y^sigma, a permuted version of Y. We determine sharp thresholds at which optimal testing exhibits a phase transition, depending on the asymptotic regime of n and d. Specifically, we prove that if rho^2dto0, as dtoinfty, then weak detection (performing slightly better than random guessing) is statistically impossible, irrespectively of the value of n. This compliments the performance of a simple test that thresholds the sum all entries of X^TY. Furthermore, when d is fixed, we prove that strong detection (vanishing error probability) is impossible for any rho<rho^star, where rho^star is an explicit function of d, while weak detection is again impossible as long as rho^2dto0. These results close significant gaps in current recent related studies.
Comparing Dataset Characteristics that Favor the Apriori, Eclat or FP-Growth Frequent Itemset Mining Algorithms
Frequent itemset mining is a popular data mining technique. Apriori, Eclat, and FP-Growth are among the most common algorithms for frequent itemset mining. Considerable research has been performed to compare the relative performance between these three algorithms, by evaluating the scalability of each algorithm as the dataset size increases. While scalability as data size increases is important, previous papers have not examined the performance impact of similarly sized datasets that contain different itemset characteristics. This paper explores the effects that two dataset characteristics can have on the performance of these three frequent itemset algorithms. To perform this empirical analysis, a dataset generator is created to measure the effects of frequent item density and the maximum transaction size on performance. The generated datasets contain the same number of rows. This provides some insight into dataset characteristics that are conducive to each algorithm. The results of this paper's research demonstrate Eclat and FP-Growth both handle increases in maximum transaction size and frequent itemset density considerably better than the Apriori algorithm. This paper explores the effects that two dataset characteristics can have on the performance of these three frequent itemset algorithms. To perform this empirical analysis, a dataset generator is created to measure the effects of frequent item density and the maximum transaction size on performance. The generated datasets contain the same number of rows. This provides some insight into dataset characteristics that are conducive to each algorithm. The results of this paper's research demonstrate Eclat and FP-Growth both handle increases in maximum transaction size and frequent itemset density considerably better than the Apriori algorithm.
Pattern Recognition of Illicit E-Waste Misclassification in Global Trade Data
The global trade in electronic and electrical goods is complicated by the challenge of identifying e-waste, which is often misclassified to evade regulations. Traditional analysis methods struggle to discern the underlying patterns of this illicit trade within vast datasets. This research proposes and validates a robust, data-driven framework to segment products and identify goods exhibiting an anomalous "waste signature" a trade pattern defined by a clear 'inverse price-volume'. The core of the framework is an Outlier-Aware Segmentation method, an iterative K-Means approach that first isolates extreme outliers to prevent data skewing and then re-clusters the remaining products to reveal subtle market segments. To quantify risk, a "Waste Score" is developed using a Logistic Regression model that identifies products whose trade signatures are statistically similar to scrap. The findings reveal a consistent four-tier market hierarchy in both Malaysian and global datasets. A key pattern emerged from a comparative analysis: Malaysia's market structure is defined by high-volume bulk commodities, whereas the global market is shaped by high-value capital goods, indicating a unique national specialization. The framework successfully flags finished goods, such as electric generators (HS 8502), that are traded like scrap, providing a targeted list for regulatory scrutiny.
Early warning signals: The charted and uncharted territories
The realization that complex systems such as ecological communities can collapse or shift regimes suddenly and without rapid external forcing poses a serious challenge to our understanding and management of the natural world. The potential to identify early warning signals that would allow researchers and managers to predict such events before they happen has therefore been an invaluable discovery that offers a way forward in spite of such seemingly unpredictable behavior. Research into early warning signals has demonstrated that it is possible to define and detect such early warning signals in advance of a transition in certain contexts. Here we describe the pattern emerging as research continues to explore just how far we can generalize these results. A core of examples emerges that shares three properties: the phenomenon of rapid regime shifts, a pattern of 'critical slowing down' that can be used to detect the approaching shift, and a mechanism of bifurcation driving the sudden change. As research has expanded beyond these core examples, it is becoming clear that not all systems that show regime shifts exhibit critical slowing down, or vice versa. Even when systems exhibit critical slowing down, statistical detection is a challenge. We review the literature that explores these edge cases and highlight the need for (a) new early warning behaviors that can be used in cases where rapid shifts do not exhibit critical slowing down, (b) the development of methods to identify which behavior might be an appropriate signal when encountering a novel system; bearing in mind that a positive indication for some systems is a negative indication in others, and (c) statistical methods that can distinguish between signatures of early warning behaviors and noise.
Statistical Methods in Generative AI
Generative Artificial Intelligence is emerging as an important technology, promising to be transformative in many areas. At the same time, generative AI techniques are based on sampling from probabilistic models, and by default, they come with no guarantees about correctness, safety, fairness, or other properties. Statistical methods offer a promising potential approach to improve the reliability of generative AI techniques. In addition, statistical methods are also promising for improving the quality and efficiency of AI evaluation, as well as for designing interventions and experiments in AI. In this paper, we review some of the existing work on these topics, explaining both the general statistical techniques used, as well as their applications to generative AI. We also discuss limitations and potential future directions.
A Framework For Refining Text Classification and Object Recognition from Academic Articles
With the widespread use of the internet, it has become increasingly crucial to extract specific information from vast amounts of academic articles efficiently. Data mining techniques are generally employed to solve this issue. However, data mining for academic articles is challenging since it requires automatically extracting specific patterns in complex and unstructured layout documents. Current data mining methods for academic articles employ rule-based(RB) or machine learning(ML) approaches. However, using rule-based methods incurs a high coding cost for complex typesetting articles. On the other hand, simply using machine learning methods requires annotation work for complex content types within the paper, which can be costly. Furthermore, only using machine learning can lead to cases where patterns easily recognized by rule-based methods are mistakenly extracted. To overcome these issues, from the perspective of analyzing the standard layout and typesetting used in the specified publication, we emphasize implementing specific methods for specific characteristics in academic articles. We have developed a novel Text Block Refinement Framework (TBRF), a machine learning and rule-based scheme hybrid. We used the well-known ACL proceeding articles as experimental data for the validation experiment. The experiment shows that our approach achieved over 95% classification accuracy and 90% detection accuracy for tables and figures.
Dive into Time-Series Anomaly Detection: A Decade Review
Recent advances in data collection technology, accompanied by the ever-rising volume and velocity of streaming data, underscore the vital need for time series analytics. In this regard, time-series anomaly detection has been an important activity, entailing various applications in fields such as cyber security, financial markets, law enforcement, and health care. While traditional literature on anomaly detection is centered on statistical measures, the increasing number of machine learning algorithms in recent years call for a structured, general characterization of the research methods for time-series anomaly detection. This survey groups and summarizes anomaly detection existing solutions under a process-centric taxonomy in the time series context. In addition to giving an original categorization of anomaly detection methods, we also perform a meta-analysis of the literature and outline general trends in time-series anomaly detection research.
On Meta-Prompting
Certain statistical models are capable of interpreting input strings as instructions, or prompts, and carry out tasks based on them. Many approaches to prompting and pre-training these models involve the automated generation of these prompts. We call these approaches meta-prompting, or prompting to obtain prompts. We propose a theoretical framework based on category theory to generalize and describe them. This framework is flexible enough to account for LLM stochasticity; and allows us to obtain formal results around task agnosticity and equivalence of various meta-prompting approaches. We experiment with meta-prompting in two active areas of model research: creativity and ideation. We find that user preference favors (p < 0.01) the prompts generated under meta-prompting, as well as their corresponding outputs, over a series of hardcoded baseline prompts that include the original task prompt. Using our framework, we argue that meta-prompting is more effective than basic prompting at generating desirable outputs.
Untangling Gaussian Mixtures
Tangles were originally introduced as a concept to formalize regions of high connectivity in graphs. In recent years, they have also been discovered as a link between structural graph theory and data science: when interpreting similarity in data sets as connectivity between points, finding clusters in the data essentially amounts to finding tangles in the underlying graphs. This paper further explores the potential of tangles in data sets as a means for a formal study of clusters. Real-world data often follow a normal distribution. Accounting for this, we develop a quantitative theory of tangles in data sets drawn from Gaussian mixtures. To this end, we equip the data with a graph structure that models similarity between the points and allows us to apply tangle theory to the data. We provide explicit conditions under which tangles associated with the marginal Gaussian distributions exist asymptotically almost surely. This can be considered as a sufficient formal criterion for the separabability of clusters in the data.
Quantifying Network Similarity using Graph Cumulants
How might one test the hypothesis that networks were sampled from the same distribution? Here, we compare two statistical tests that use subgraph counts to address this question. The first uses the empirical subgraph densities themselves as estimates of those of the underlying distribution. The second test uses a new approach that converts these subgraph densities into estimates of the graph cumulants of the distribution (without any increase in computational complexity). We demonstrate -- via theory, simulation, and application to real data -- the superior statistical power of using graph cumulants. In summary, when analyzing data using subgraph/motif densities, we suggest using the corresponding graph cumulants instead.
Early Warning Signals and the Prosecutor's Fallacy
Early warning signals have been proposed to forecast the possibility of a critical transition, such as the eutrophication of a lake, the collapse of a coral reef, or the end of a glacial period. Because such transitions often unfold on temporal and spatial scales that can be difficult to approach by experimental manipulation, research has often relied on historical observations as a source of natural experiments. Here we examine a critical difference between selecting systems for study based on the fact that we have observed a critical transition and those systems for which we wish to forecast the approach of a transition. This difference arises by conditionally selecting systems known to experience a transition of some sort and failing to account for the bias this introduces -- a statistical error often known as the Prosecutor's Fallacy. By analysing simulated systems that have experienced transitions purely by chance, we reveal an elevated rate of false positives in common warning signal statistics. We further demonstrate a model-based approach that is less subject to this bias than these more commonly used summary statistics. We note that experimental studies with replicates avoid this pitfall entirely.
ChronosX: Adapting Pretrained Time Series Models with Exogenous Variables
Covariates provide valuable information on external factors that influence time series and are critical in many real-world time series forecasting tasks. For example, in retail, covariates may indicate promotions or peak dates such as holiday seasons that heavily influence demand forecasts. Recent advances in pretraining large language model architectures for time series forecasting have led to highly accurate forecasters. However, the majority of these models do not readily use covariates as they are often specific to a certain task or domain. This paper introduces a new method to incorporate covariates into pretrained time series forecasting models. Our proposed approach incorporates covariate information into pretrained forecasting models through modular blocks that inject past and future covariate information, without necessarily modifying the pretrained model in consideration. In order to evaluate our approach, we introduce a benchmark composed of 32 different synthetic datasets with varying dynamics to evaluate the effectivity of forecasting models with covariates. Extensive evaluations on both synthetic and real datasets show that our approach effectively incorporates covariate information into pretrained models, outperforming existing baselines.
Can ChatGPT Compute Trustworthy Sentiment Scores from Bloomberg Market Wraps?
We used a dataset of daily Bloomberg Financial Market Summaries from 2010 to 2023, reposted on large financial media, to determine how global news headlines may affect stock market movements using ChatGPT and a two-stage prompt approach. We document a statistically significant positive correlation between the sentiment score and future equity market returns over short to medium term, which reverts to a negative correlation over longer horizons. Validation of this correlation pattern across multiple equity markets indicates its robustness across equity regions and resilience to non-linearity, evidenced by comparison of Pearson and Spearman correlations. Finally, we provide an estimate of the optimal horizon that strikes a balance between reactivity to new information and correlation.
A Framework for Predictive Analysis of Stock Market Indices : A Study of the Indian Auto Sector
Analysis and prediction of stock market time series data has attracted considerable interest from the research community over the last decade. Rapid development and evolution of sophisticated algorithms for statistical analysis of time series data, and availability of high-performance hardware has made it possible to process and analyze high volume stock market time series data effectively, in real-time. Among many other important characteristics and behavior of such data, forecasting is an area which has witnessed considerable focus. In this work, we have used time series of the index values of the Auto sector in India during January 2010 to December 2015 for a deeper understanding of the behavior of its three constituent components, e.g., the trend, the seasonal component, and the random component. Based on this structural analysis, we have also designed five approaches for forecasting and also computed their accuracy in prediction using suitably chosen training and test data sets. Extensive results are presented to demonstrate the effectiveness of our proposed decomposition approaches of time series and the efficiency of our forecasting techniques, even in presence of a random component and a sharply changing trend component in the time-series.
Rare Galaxy Classes Identified In Foundation Model Representations
We identify rare and visually distinctive galaxy populations by searching for structure within the learned representations of pretrained models. We show that these representations arrange galaxies by appearance in patterns beyond those needed to predict the pretraining labels. We design a clustering approach to isolate specific local patterns, revealing groups of galaxies with rare and scientifically-interesting morphologies.
Magnitude of arithmetic scalar and matrix categories
We develop tools for explicitly constructing categories enriched over generating data and that compose via ordinary scalar and matrix arithmetic arithmetic operations. We characterize meaningful size maps, weightings, and magnitude that reveal features analogous to outliers that these same notions have previously been shown to reveal in the context of metric spaces. Throughout, we provide examples of such "outlier detection" relevant to the analysis of computer programs, neural networks, cyber-physical systems, and networks of communications channels.
BanglishRev: A Large-Scale Bangla-English and Code-mixed Dataset of Product Reviews in E-Commerce
This work presents the BanglishRev Dataset, the largest e-commerce product review dataset to date for reviews written in Bengali, English, a mixture of both and Banglish, Bengali words written with English alphabets. The dataset comprises of 1.74 million written reviews from 3.2 million ratings information collected from a total of 128k products being sold in online e-commerce platforms targeting the Bengali population. It includes an extensive array of related metadata for each of the reviews including the rating given by the reviewer, date the review was posted and date of purchase, number of likes, dislikes, response from the seller, images associated with the review etc. With sentiment analysis being the most prominent usage of review datasets, experimentation with a binary sentiment analysis model with the review rating serving as an indicator of positive or negative sentiment was conducted to evaluate the effectiveness of the large amount of data presented in BanglishRev for sentiment analysis tasks. A BanglishBERT model is trained on the data from BanglishRev with reviews being considered labeled positive if the rating is greater than 3 and negative if the rating is less than or equal to 3. The model is evaluated by being testing against a previously published manually annotated dataset for e-commerce reviews written in a mixture of Bangla, English and Banglish. The experimental model achieved an exceptional accuracy of 94\% and F1 score of 0.94, demonstrating the dataset's efficacy for sentiment analysis. Some of the intriguing patterns and observations seen within the dataset and future research directions where the dataset can be utilized is also discussed and explored. The dataset can be accessed through https://huggingface.co/datasets/BanglishRev/bangla-english-and-code-mixed-ecommerce-review-dataset.
Further Generalizations of the Jaccard Index
Quantifying the similarity between two mathematical structures or datasets constitutes a particularly interesting and useful operation in several theoretical and applied problems. Aimed at this specific objective, the Jaccard index has been extensively used in the most diverse types of problems, also motivating some respective generalizations. The present work addresses further generalizations of this index, including its modification into a coincidence index capable of accounting also for the level of relative interiority between the two compared entities, as well as respective extensions for sets in continuous vector spaces, the generalization to multiset addition, densities and generic scalar fields, as well as a means to quantify the joint interdependence between two random variables. The also interesting possibility to take into account more than two sets has also been addressed, including the description of an index capable of quantifying the level of chaining between three structures. Several of the described and suggested eneralizations have been illustrated with respect to numeric case examples. It is also posited that these indices can play an important role while analyzing and integrating datasets in modeling approaches and pattern recognition activities, including as a measurement of clusters similarity or separation and as a resource for representing and analyzing complex networks.
ChatGPT Prompt Patterns for Improving Code Quality, Refactoring, Requirements Elicitation, and Software Design
This paper presents prompt design techniques for software engineering, in the form of patterns, to solve common problems when using large language models (LLMs), such as ChatGPT to automate common software engineering activities, such as ensuring code is decoupled from third-party libraries and simulating a web application API before it is implemented. This paper provides two contributions to research on using LLMs for software engineering. First, it provides a catalog of patterns for software engineering that classifies patterns according to the types of problems they solve. Second, it explores several prompt patterns that have been applied to improve requirements elicitation, rapid prototyping, code quality, refactoring, and system design.
Modular versus Hierarchical: A Structural Signature of Topic Popularity in Mathematical Research
Mathematical researchers, especially those in early-career positions, face critical decisions about topic specialization with limited information about the collaborative environments of different research areas. The aim of this paper is to study how the popularity of a research topic is associated with the structure of that topic's collaboration network, as observed by a suite of measures capturing organizational structure at several scales. We apply these measures to 1,938 algorithmically discovered topics across 121,391 papers sourced from arXiv metadata during the period 2020--2025. Our analysis, which controls for the confounding effects of network size, reveals a structural dichotomy--we find that popular topics organize into modular "schools of thought," while niche topics maintain hierarchical core-periphery structures centered around established experts. This divide is not an artifact of scale, but represents a size-independent structural pattern correlated with popularity. We also document a "constraint reversal": after controlling for size, researchers in popular fields face greater structural constraints on collaboration opportunities, contrary to conventional expectations. Our findings suggest that topic selection is an implicit choice between two fundamentally different collaborative environments, each with distinct implications for a researcher's career. To make these structural patterns transparent to the research community, we developed the Math Research Compass (https://mathresearchcompass.com), an interactive platform providing data on topic popularity and collaboration patterns across mathematical topics.
Algorithmic Writing Assistance on Jobseekers' Resumes Increases Hires
There is a strong association between the quality of the writing in a resume for new labor market entrants and whether those entrants are ultimately hired. We show that this relationship is, at least partially, causal: a field experiment in an online labor market was conducted with nearly half a million jobseekers in which a treated group received algorithmic writing assistance. Treated jobseekers experienced an 8% increase in the probability of getting hired. Contrary to concerns that the assistance is taking away a valuable signal, we find no evidence that employers were less satisfied. We present a model in which better writing is not a signal of ability but helps employers ascertain ability, which rationalizes our findings.
An Investigation of the Structural Characteristics of the Indian IT Sector and the Capital Goods Sector: An Application of the R Programming in Time Series Decomposition and Forecasting
Time series analysis and forecasting of stock market prices has been a very active area of research over the last two decades. Availability of extremely fast and parallel architecture of computing and sophisticated algorithms has made it possible to extract, store, process and analyze high volume stock market time series data very efficiently. In this paper, we have used time series data of the two sectors of the Indian economy: Information Technology and Capital Goods for the period January 2009 till April 2016 and have studied the relationships of these two time series with the time series of DJIA index, NIFTY index and the US Dollar to Indian Rupee exchange rate. We establish by graphical and statistical tests that while the IT sector of India has a strong association with DJIA index and the Dollar to Rupee exchange rate, the Indian CG sector exhibits a strong association with the NIFTY index. We contend that these observations corroborate our hypotheses that the Indian IT sector is strongly coupled with the world economy whereas the CG sector of India reflects internal economic growth of India. We also present several models of regression between the time series which exhibit strong association among them. The effectiveness of these models have been demonstrated by very low values of their forecasting errors.
Geometry of Sample Spaces
In statistics, independent, identically distributed random samples do not carry a natural ordering, and their statistics are typically invariant with respect to permutations of their order. Thus, an n-sample in a space M can be considered as an element of the quotient space of M^n modulo the permutation group. The present paper takes this definition of sample space and the related concept of orbit types as a starting point for developing a geometric perspective on statistics. We aim at deriving a general mathematical setting for studying the behavior of empirical and population means in spaces ranging from smooth Riemannian manifolds to general stratified spaces. We fully describe the orbifold and path-metric structure of the sample space when M is a manifold or path-metric space, respectively. These results are non-trivial even when M is Euclidean. We show that the infinite sample space exists in a Gromov-Hausdorff type sense and coincides with the Wasserstein space of probability distributions on M. We exhibit Fr\'echet means and k-means as metric projections onto 1-skeleta or k-skeleta in Wasserstein space, and we define a new and more general notion of polymeans. This geometric characterization via metric projections applies equally to sample and population means, and we use it to establish asymptotic properties of polymeans such as consistency and asymptotic normality.
SciTextures: Collecting and Connecting Visual Patterns, Models, and Code Across Science and Art
The ability to connect visual patterns with the processes that form them represents one of the deepest forms of visual understanding. Textures of clouds and waves, the growth of cities and forests, or the formation of materials and landscapes are all examples of patterns emerging from underlying mechanisms. We present the Scitextures dataset, a large-scale collection of textures and visual patterns from all domains of science, tech, and art, along with the models and code that generate these images. Covering over 1,200 different models and 100,000 images of patterns and textures from physics, chemistry, biology, sociology, technology, mathematics, and art, this dataset offers a way to explore the connection between the visual patterns that shape our world and the mechanisms that produce them. Created by an agentic AI pipeline that autonomously collects and implements models in standardized form, we use SciTextures to evaluate the ability of leading AI models to link visual patterns to the models and code that generate them, and to identify different patterns that emerged from the same process. We also test AIs ability to infer and recreate the mechanisms behind visual patterns by providing a natural image of a real-world pattern and asking the AI to identify, model, and code the mechanism that formed the pattern, then run this code to generate a simulated image that is compared to the real image. These benchmarks show that vision-language models (VLMs) can understand and simulate the physical system beyond a visual pattern. The dataset and code are available at: https://zenodo.org/records/17485502
Analytical Derivation and Comparison of Alarm Similarity Measures
An industrial process includes many devices, variables, and sub-processes that are physically or electronically interconnected. These interconnections imply some level of correlation between different process variables. Since most of the alarms in a process plant are defined on process variables, alarms are also correlated. However, this can be a nuisance to operators, for one fault might trigger a, sometimes large, number of alarms. So, it is essential to find and correct correlated alarms. In this paper, we study different methods and techniques proposed to measure correlation or similarity between alarms. The similarity indices are first analytically calculated and then studied and compared. The results are also validated using Monte-Carlo simulation.
To Know by the Company Words Keep and What Else Lies in the Vicinity
The development of state-of-the-art (SOTA) Natural Language Processing (NLP) systems has steadily been establishing new techniques to absorb the statistics of linguistic data. These techniques often trace well-known constructs from traditional theories, and we study these connections to close gaps around key NLP methods as a means to orient future work. For this, we introduce an analytic model of the statistics learned by seminal algorithms (including GloVe and Word2Vec), and derive insights for systems that use these algorithms and the statistics of co-occurrence, in general. In this work, we derive -- to the best of our knowledge -- the first known solution to Word2Vec's softmax-optimized, skip-gram algorithm. This result presents exciting potential for future development as a direct solution to a deep learning (DL) language model's (LM's) matrix factorization. However, we use the solution to demonstrate a seemingly-universal existence of a property that word vectors exhibit and which allows for the prophylactic discernment of biases in data -- prior to their absorption by DL models. To qualify our work, we conduct an analysis of independence, i.e., on the density of statistical dependencies in co-occurrence models, which in turn renders insights on the distributional hypothesis' partial fulfillment by co-occurrence statistics.
Extending Mixture of Experts Model to Investigate Heterogeneity of Trajectories: When, Where and How to Add Which Covariates
Researchers are usually interested in examining the impact of covariates when separating heterogeneous samples into latent classes that are more homogeneous. The majority of theoretical and empirical studies with such aims have focused on identifying covariates as predictors of class membership in the structural equation modeling framework. In other words, the covariates only indirectly affect the sample heterogeneity. However, the covariates' influence on between-individual differences can also be direct. This article presents a mixture model that investigates covariates to explain within-cluster and between-cluster heterogeneity simultaneously, known as a mixture-of-experts (MoE) model. This study aims to extend the MoE framework to investigate heterogeneity in nonlinear trajectories: to identify latent classes, covariates as predictors to clusters, and covariates that explain within-cluster differences in change patterns over time. Our simulation studies demonstrate that the proposed model generally estimates the parameters unbiasedly, precisely and exhibits appropriate empirical coverage for a nominal 95% confidence interval. This study also proposes implementing structural equation model forests to shrink the covariate space of the proposed mixture model. We illustrate how to select covariates and construct the proposed model with longitudinal mathematics achievement data. Additionally, we demonstrate that the proposed mixture model can be further extended in the structural equation modeling framework by allowing the covariates that have direct effects to be time-varying.
Temporal-Spatial dependencies ENhanced deep learning model (TSEN) for household leverage series forecasting
Analyzing both temporal and spatial patterns for an accurate forecasting model for financial time series forecasting is a challenge due to the complex nature of temporal-spatial dynamics: time series from different locations often have distinct patterns; and for the same time series, patterns may vary as time goes by. Inspired by the successful applications of deep learning, we propose a new model to resolve the issues of forecasting household leverage in China. Our solution consists of multiple RNN-based layers and an attention layer: each RNN-based layer automatically learns the temporal pattern of a specific series with multivariate exogenous series, and then the attention layer learns the spatial correlative weight and obtains the global representations simultaneously. The results show that the new approach can capture the temporal-spatial dynamics of household leverage well and get more accurate and solid predictive results. More, the simulation also studies show that clustering and choosing correlative series are necessary to obtain accurate forecasting results.
Time Series Analysis for Education: Methods, Applications, and Future Directions
Recent advancements in the collection and analysis of sequential educational data have brought time series analysis to a pivotal position in educational research, highlighting its essential role in facilitating data-driven decision-making. However, there is a lack of comprehensive summaries that consolidate these advancements. To the best of our knowledge, this paper is the first to provide a comprehensive review of time series analysis techniques specifically within the educational context. We begin by exploring the landscape of educational data analytics, categorizing various data sources and types relevant to education. We then review four prominent time series methods-forecasting, classification, clustering, and anomaly detection-illustrating their specific application points in educational settings. Subsequently, we present a range of educational scenarios and applications, focusing on how these methods are employed to address diverse educational tasks, which highlights the practical integration of multiple time series methods to solve complex educational problems. Finally, we conclude with a discussion on future directions, including personalized learning analytics, multimodal data fusion, and the role of large language models (LLMs) in educational time series. The contributions of this paper include a detailed taxonomy of educational data, a synthesis of time series techniques with specific educational applications, and a forward-looking perspective on emerging trends and future research opportunities in educational analysis. The related papers and resources are available and regularly updated at the project page.
Two-parameter superposable S-curves
Straight line equation y=mx with slope m, when singularly perturbed as ay^3+y=mx with a positive parameter a, results in S-shaped curves or S-curves on a real plane. As arightarrow 0, we get back y=mx which is a cumulative distribution function of a continuous uniform distribution that describes the occurrence of every event in an interval to be equally probable. As arightarrowinfty, the derivative of y has finite support only at y=0 resembling a degenerate distribution. Based on these arguments, in this work, we propose that these S-curves can represent maximum entropy uniform distribution to a zero entropy single value. We also argue that these S-curves are superposable as they are only parametrically nonlinear but fundamentally linear. So far, the superposed forms have been used to capture the patterns of natural systems such as nonlinear dynamics of biological growth and kinetics of enzyme reactions. Here, we attempt to use the S-curve and its superposed form as statistical models. We fit the models on a classical dataset containing flower measurements of iris plants and analyze their usefulness in pattern recognition. Based on these models, we claim that any non-uniform pattern can be represented as a singular perturbation to uniform distribution. However, our parametric estimation procedure have some limitations such as sensitivity to initial conditions depending on the data at hand.
A Note on Shumailov et al. (2024): `AI Models Collapse When Trained on Recursively Generated Data'
The study conducted by Shumailov et al. (2024) demonstrates that repeatedly training a generative model on synthetic data leads to model collapse. This finding has generated considerable interest and debate, particularly given that current models have nearly exhausted the available data. In this work, we investigate the effects of fitting a distribution (through Kernel Density Estimation, or KDE) or a model to the data, followed by repeated sampling from it. Our objective is to develop a theoretical understanding of the phenomenon observed by Shumailov et al. (2024). Our results indicate that the outcomes reported are a statistical phenomenon and may be unavoidable.
TiVy: Time Series Visual Summary for Scalable Visualization
Visualizing multiple time series presents fundamental tradeoffs between scalability and visual clarity. Time series capture the behavior of many large-scale real-world processes, from stock market trends to urban activities. Users often gain insights by visualizing them as line charts, juxtaposing or superposing multiple time series to compare them and identify trends and patterns. However, existing representations struggle with scalability: when covering long time spans, leading to visual clutter from too many small multiples or overlapping lines. We propose TiVy, a new algorithm that summarizes time series using sequential patterns. It transforms the series into a set of symbolic sequences based on subsequence visual similarity using Dynamic Time Warping (DTW), then constructs a disjoint grouping of similar subsequences based on the frequent sequential patterns. The grouping result, a visual summary of time series, provides uncluttered superposition with fewer small multiples. Unlike common clustering techniques, TiVy extracts similar subsequences (of varying lengths) aligned in time. We also present an interactive time series visualization that renders large-scale time series in real-time. Our experimental evaluation shows that our algorithm (1) extracts clear and accurate patterns when visualizing time series data, (2) achieves a significant speed-up (1000X) compared to a straightforward DTW clustering. We also demonstrate the efficiency of our approach to explore hidden structures in massive time series data in two usage scenarios.
An Interdisciplinary Comparison of Sequence Modeling Methods for Next-Element Prediction
Data of sequential nature arise in many application domains in forms of, e.g. textual data, DNA sequences, and software execution traces. Different research disciplines have developed methods to learn sequence models from such datasets: (i) in the machine learning field methods such as (hidden) Markov models and recurrent neural networks have been developed and successfully applied to a wide-range of tasks, (ii) in process mining process discovery techniques aim to generate human-interpretable descriptive models, and (iii) in the grammar inference field the focus is on finding descriptive models in the form of formal grammars. Despite their different focuses, these fields share a common goal - learning a model that accurately describes the behavior in the underlying data. Those sequence models are generative, i.e, they can predict what elements are likely to occur after a given unfinished sequence. So far, these fields have developed mainly in isolation from each other and no comparison exists. This paper presents an interdisciplinary experimental evaluation that compares sequence modeling techniques on the task of next-element prediction on four real-life sequence datasets. The results indicate that machine learning techniques that generally have no aim at interpretability in terms of accuracy outperform techniques from the process mining and grammar inference fields that aim to yield interpretable models.
Causal de Finetti: On the Identification of Invariant Causal Structure in Exchangeable Data
Learning causal structure from observational data often assumes that we observe independent and identically distributed (i.\,i.\,d) data. The traditional approach aims to find a graphical representation that encodes the same set of conditional independence relationships as those present in the observed distribution. It is known that under i.\,i.\,d assumption, even with infinite data, there is a limit to how fine-grained a causal structure we can identify. To overcome this limitation, recent work has explored using data originating from different, related environments to learn richer causal structure. These approaches implicitly rely on the independent causal mechanisms (ICM) principle, which postulates that the mechanism giving rise to an effect given its causes and the mechanism which generates the causes do not inform or influence each other. Thus, components of the causal model can independently change from environment to environment. Despite its wide application in machine learning and causal inference, there is a lack of statistical formalization of the ICM principle and how it enables identification of richer causal structures from grouped data. Here we present new causal de Finetti theorems which offer a first statistical formalization of ICM principle and show how causal structure identification is possible from exchangeable data. Our work provides theoretical justification for a broad range of techniques leveraging multi-environment data to learn causal structure.
CSTS: A Benchmark for the Discovery of Correlation Structures in Time Series Clustering
Time series clustering promises to uncover hidden structural patterns in data with applications across healthcare, finance, industrial systems, and other critical domains. However, without validated ground truth information, researchers cannot objectively assess clustering quality or determine whether poor results stem from absent structures in the data, algorithmic limitations, or inappropriate validation methods, raising the question whether clustering is "more art than science" (Guyon et al., 2009). To address these challenges, we introduce CSTS (Correlation Structures in Time Series), a synthetic benchmark for evaluating the discovery of correlation structures in multivariate time series data. CSTS provides a clean benchmark that enables researchers to isolate and identify specific causes of clustering failures by differentiating between correlation structure deterioration and limitations of clustering algorithms and validation methods. Our contributions are: (1) a comprehensive benchmark for correlation structure discovery with distinct correlation structures, systematically varied data conditions, established performance thresholds, and recommended evaluation protocols; (2) empirical validation of correlation structure preservation showing moderate distortion from downsampling and minimal effects from distribution shifts and sparsification; and (3) an extensible data generation framework enabling structure-first clustering evaluation. A case study demonstrates CSTS's practical utility by identifying an algorithm's previously undocumented sensitivity to non-normal distributions, illustrating how the benchmark enables precise diagnosis of methodological limitations. CSTS advances rigorous evaluation standards for correlation-based time series clustering.
Partial Correlations in Compositional Data Analysis
Partial correlations quantify linear association between two variables adjusting for the influence of the remaining variables. They form the backbone for graphical models and are readily obtained from the inverse of the covariance matrix. For compositional data, the covariance structure is specified from log ratios of variables, so unless we try to "open" the data via a normalization, this implies changes in the definition and interpretation of partial correlations. In the present work, we elucidate how results derived by Aitchison (1986) lead to a natural definition of partial correlation that has a number of advantages over current measures of association. For this, we show that the residuals of log-ratios between a variable with a reference, when adjusting for all remaining variables including the reference, are reference-independent. Since the reference itself can be controlled for, correlations between residuals are defined for the variables directly without the necessity to recur to ratios except when specifying which variables are partialled out. Thus, perhaps surprisingly, partial correlations do not have the problems commonly found with measures of pairwise association on compositional data. They are well-defined between two variables, are properly scaled, and allow for negative association. By design, they are subcompositionally incoherent, but they share this property with conventional partial correlations (where results change when adjusting for the influence of fewer variables). We discuss the equivalence with normalization-based approaches whenever the normalizing variables are controlled for. We also discuss the partial variances and correlations we obtain from a previously studied data set of Roman glass cups.
Empirical analysis of Binding Precedent efficiency in the Brazilian Supreme Court via Similar Case Retrieval
Binding precedents (S\'umulas Vinculantes) constitute a juridical instrument unique to the Brazilian legal system and whose objectives include the protection of the Federal Supreme Court against repetitive demands. Studies of the effectiveness of these instruments in decreasing the Court's exposure to similar cases, however, indicate that they tend to fail in such a direction, with some of the binding precedents seemingly creating new demands. We empirically assess the legal impact of five binding precedents, 11, 14, 17, 26 and 37, at the highest court level through their effects on the legal subjects they address. This analysis is only possible through the comparison of the Court's ruling about the precedents' themes before they are created, which means that these decisions should be detected through techniques of Similar Case Retrieval. The contributions of this article are therefore twofold: on the mathematical side, we compare the uses of different methods of Natural Language Processing -- TF-IDF, LSTM, BERT, and regex -- for Similar Case Retrieval, whereas on the legal side, we contrast the inefficiency of these binding precedents with a set of hypotheses that may justify their repeated usage. We observe that the deep learning models performed significantly worse in the specific Similar Case Retrieval task and that the reasons for binding precedents to fail in responding to repetitive demand are heterogeneous and case-dependent, making it impossible to single out a specific cause.
Questioning the Survey Responses of Large Language Models
As large language models increase in capability, researchers have started to conduct surveys of all kinds on these models with varying scientific motivations. In this work, we examine what we can learn from a model's survey responses on the basis of the well-established American Community Survey (ACS) by the U.S. Census Bureau. Evaluating more than a dozen different models, varying in size from a few hundred million to ten billion parameters, hundreds of thousands of times each on questions from the ACS, we systematically establish two dominant patterns. First, smaller models have a significant position and labeling bias, for example, towards survey responses labeled with the letter "A". This A-bias diminishes, albeit slowly, as model size increases. Second, when adjusting for this labeling bias through randomized answer ordering, models still do not trend toward US population statistics or those of any cognizable population. Rather, models across the board trend toward uniformly random aggregate statistics over survey responses. This pattern is robust to various different ways of prompting the model, including what is the de-facto standard. Our findings demonstrate that aggregate statistics of a language model's survey responses lack the signals found in human populations. This absence of statistical signal cautions about the use of survey responses from large language models at present time.
Feature Shift Detection: Localizing Which Features Have Shifted via Conditional Distribution Tests
While previous distribution shift detection approaches can identify if a shift has occurred, these approaches cannot localize which specific features have caused a distribution shift -- a critical step in diagnosing or fixing any underlying issue. For example, in military sensor networks, users will want to detect when one or more of the sensors has been compromised, and critically, they will want to know which specific sensors might be compromised. Thus, we first define a formalization of this problem as multiple conditional distribution hypothesis tests and propose both non-parametric and parametric statistical tests. For both efficiency and flexibility, we then propose to use a test statistic based on the density model score function (i.e. gradient with respect to the input) -- which can easily compute test statistics for all dimensions in a single forward and backward pass. Any density model could be used for computing the necessary statistics including deep density models such as normalizing flows or autoregressive models. We additionally develop methods for identifying when and where a shift occurs in multivariate time-series data and show results for multiple scenarios using realistic attack models on both simulated and real world data.
Blackbox Model Provenance via Palimpsestic Membership Inference
Suppose Alice trains an open-weight language model and Bob uses a blackbox derivative of Alice's model to produce text. Can Alice prove that Bob is using her model, either by querying Bob's derivative model (query setting) or from the text alone (observational setting)? We formulate this question as an independence testing problem--in which the null hypothesis is that Bob's model or text is independent of Alice's randomized training run--and investigate it through the lens of palimpsestic memorization in language models: models are more likely to memorize data seen later in training, so we can test whether Bob is using Alice's model using test statistics that capture correlation between Bob's model or text and the ordering of training examples in Alice's training run. If Alice has randomly shuffled her training data, then any significant correlation amounts to exactly quantifiable statistical evidence against the null hypothesis, regardless of the composition of Alice's training data. In the query setting, we directly estimate (via prompting) the likelihood Bob's model gives to Alice's training examples and order; we correlate the likelihoods of over 40 fine-tunes of various Pythia and OLMo base models ranging from 1B to 12B parameters with the base model's training data order, achieving a p-value on the order of at most 1e-8 in all but six cases. In the observational setting, we try two approaches based on estimating 1) the likelihood of Bob's text overlapping with spans of Alice's training examples and 2) the likelihood of Bob's text with respect to different versions of Alice's model we obtain by repeating the last phase (e.g., 1%) of her training run on reshuffled data. The second approach can reliably distinguish Bob's text from as little as a few hundred tokens; the first does not involve any retraining but requires many more tokens (several hundred thousand) to achieve high power.
Linking Datasets on Organizations Using Half A Billion Open Collaborated Records
Scholars studying organizations often work with multiple datasets lacking shared unique identifiers or covariates. In such situations, researchers may turn to approximate string matching methods to combine datasets. String matching, although useful, faces fundamental challenges. Even when two strings appear similar to humans, fuzzy matching often does not work because it fails to adapt to the informativeness of the character combinations presented. Worse, many entities have multiple names that are dissimilar (e.g., "Fannie Mae" and "Federal National Mortgage Association"), a case where string matching has little hope of succeeding. This paper introduces data from a prominent employment-related networking site (LinkedIn) as a tool to address these problems. We propose interconnected approaches to leveraging the massive amount of information from LinkedIn regarding organizational name-to-name links. The first approach builds a machine learning model for predicting matches from character strings, treating the trillions of user-contributed organizational name pairs as a training corpus: this approach constructs a string matching metric that explicitly maximizes match probabilities. A second approach identifies relationships between organization names using network representations of the LinkedIn data. A third approach combines the first and second. We document substantial improvements over fuzzy matching in applications, making all methods accessible in open-source software ("LinkOrgs").
Benchmarking Clinical Decision Support Search
Finding relevant literature underpins the practice of evidence-based medicine. From 2014 to 2016, TREC conducted a clinical decision support track, wherein participants were tasked with finding articles relevant to clinical questions posed by physicians. In total, 87 teams have participated over the past three years, generating 395 runs. During this period, each team has trialled a variety of methods. While there was significant overlap in the methods employed by different teams, the results were varied. Due to the diversity of the platforms used, the results arising from the different techniques are not directly comparable, reducing the ability to build on previous work. By using a stable platform, we have been able to compare different document and query processing techniques, allowing us to experiment with different search parameters. We have used our system to reproduce leading teams runs, and compare the results obtained. By benchmarking our indexing and search techniques, we can statistically test a variety of hypotheses, paving the way for further research.
A Comparative Study of Sentence Embedding Models for Assessing Semantic Variation
Analyzing the pattern of semantic variation in long real-world texts such as books or transcripts is interesting from the stylistic, cognitive, and linguistic perspectives. It is also useful for applications such as text segmentation, document summarization, and detection of semantic novelty. The recent emergence of several vector-space methods for sentence embedding has made such analysis feasible. However, this raises the issue of how consistent and meaningful the semantic representations produced by various methods are in themselves. In this paper, we compare several recent sentence embedding methods via time-series of semantic similarity between successive sentences and matrices of pairwise sentence similarity for multiple books of literature. In contrast to previous work using target tasks and curated datasets to compare sentence embedding methods, our approach provides an evaluation of the methods 'in the wild'. We find that most of the sentence embedding methods considered do infer highly correlated patterns of semantic similarity in a given document, but show interesting differences.
Modelling Major Disease Outbreaks in the 21st Century: A Causal Approach
Epidemiologists aiming to model the dynamics of global events face a significant challenge in identifying the factors linked with anomalies such as disease outbreaks. In this paper, we present a novel method for identifying the most important development sectors sensitive to disease outbreaks by using global development indicators as markers. We use statistical methods to assess the causative linkages between these indicators and disease outbreaks, as well as to find the most often ranked indicators. We used data imputation techniques in addition to statistical analysis to convert raw real-world data sets into meaningful data for causal inference. The application of various algorithms for the detection of causal linkages between the indicators is the subject of this research. Despite the fact that disparities in governmental policies between countries account for differences in causal linkages, several indicators emerge as important determinants sensitive to disease outbreaks over the world in the 21st Century.
Pattern Discovery in Time Series with Byte Pair Encoding
The growing popularity of wearable sensors has generated large quantities of temporal physiological and activity data. Ability to analyze this data offers new opportunities for real-time health monitoring and forecasting. However, temporal physiological data presents many analytic challenges: the data is noisy, contains many missing values, and each series has a different length. Most methods proposed for time series analysis and classification do not handle datasets with these characteristics nor do they offer interpretability and explainability, a critical requirement in the health domain. We propose an unsupervised method for learning representations of time series based on common patterns identified within them. The patterns are, interpretable, variable in length, and extracted using Byte Pair Encoding compression technique. In this way the method can capture both long-term and short-term dependencies present in the data. We show that this method applies to both univariate and multivariate time series and beats state-of-the-art approaches on a real world dataset collected from wearable sensors.
A Large-Scale Dataset of Search Interests Related to Disease X Originating from Different Geographic Regions
The World Health Organization added Disease X to their shortlist of blueprint priority diseases to represent a hypothetical, unknown pathogen that could cause a future epidemic. During different virus outbreaks of the past, such as COVID-19, Influenza, Lyme Disease, and Zika virus, researchers from various disciplines utilized Google Trends to mine multimodal components of web behavior to study, investigate, and analyze the global awareness, preparedness, and response associated with these respective virus outbreaks. As the world prepares for Disease X, a dataset on web behavior related to Disease X would be crucial to contribute towards the timely advancement of research in this field. Furthermore, none of the prior works in this field have focused on the development of a dataset to compile relevant web behavior data, which would help to prepare for Disease X. To address these research challenges, this work presents a dataset of web behavior related to Disease X, which emerged from different geographic regions of the world, between February 2018 and August 2023. Specifically, this dataset presents the search interests related to Disease X from 94 geographic regions. The dataset was developed by collecting data using Google Trends. The relevant search interests for all these regions for each month in this time range are available in this dataset. This paper also discusses the compliance of this dataset with the FAIR principles of scientific data management. Finally, an analysis of this dataset is presented to uphold the applicability, relevance, and usefulness of this dataset for the investigation of different research questions in the interrelated fields of Big Data, Data Mining, Healthcare, Epidemiology, and Data Analysis with a specific focus on Disease X.
Testing the Cosmological Principle: Astrometric Limits on Systemic Motion of Quasars at Different Cosmological Epochs
A sample of 60,410 bona fide optical quasars with astrometric proper motions in Gaia EDR3 and spectroscopic redshifts above 0.5 in an oval 8400 square degree area of the sky is constructed. Using orthogonal Zernike functions of polar coordinates, the proper motion fields are fitted in a weighted least-squares adjustment of the entire sample and of six equal bins of sorted redshifts. The overall fit with 37 Zernike functions reveals a statistically significant pattern, which is likely to be of instrumental origin. The main feature of this pattern is a chain of peaks and dips mostly in the R.A. component with an amplitude of 25~muas yr^{-1}. This field is subtracted from each of the six analogous fits for quasars grouped by redshifts covering the range 0.5 through 7.03, with median values 0.72, 1.00, 1.25, 1.52, 1.83, 2.34. The resulting residual patterns are noisier, with formal uncertainties up to 8~muas yr^{-1} in the central part of the area. We detect a single high-confidence Zernike term for R.A. proper motion components of quasars with redshifts around 1.52 representing a general gradient of 30 muas yr^{-1} over 150degr on the sky. We do not find any small- or medium-scale systemic variations of the residual proper motion field as functions of redshift above the 2.5,sigma significance level.
Forensic Activity Classification Using Digital Traces from iPhones: A Machine Learning-based Approach
Smartphones and smartwatches are ever-present in daily life, and provide a rich source of information on their users' behaviour. In particular, digital traces derived from the phone's embedded movement sensors present an opportunity for a forensic investigator to gain insight into a person's physical activities. In this work, we present a machine learning-based approach to translate digital traces into likelihood ratios (LRs) for different types of physical activities. Evaluating on a new dataset, NFI\_FARED, which contains digital traces from four different types of iPhones labelled with 19 activities, it was found that our approach could produce useful LR systems to distinguish 167 out of a possible 171 activity pairings. The same approach was extended to analyse likelihoods for multiple activities (or groups of activities) simultaneously and create activity timelines to aid in both the early and latter stages of forensic investigations. The dataset and all code required to replicate the results have also been made public to encourage further research on this topic.
A Pipeline for Business Intelligence and Data-Driven Root Cause Analysis on Categorical Data
Business intelligence (BI) is any knowledge derived from existing data that may be strategically applied within a business. Data mining is a technique or method for extracting BI from data using statistical data modeling. Finding relationships or correlations between the various data items that have been collected can be used to boost business performance or at the very least better comprehend what is going on. Root cause analysis (RCA) is discovering the root causes of problems or events to identify appropriate solutions. RCA can show why an event occurred and this can help in avoiding occurrences of an issue in the future. This paper proposes a new clustering + association rule mining pipeline for getting business insights from data. The results of this pipeline are in the form of association rules having consequents, antecedents, and various metrics to evaluate these rules. The results of this pipeline can help in anchoring important business decisions and can also be used by data scientists for updating existing models or while developing new ones. The occurrence of any event is explained by its antecedents in the generated rules. Hence this output can also help in data-driven root cause analysis.
A Solvable Model of Neural Scaling Laws
Large language models with a huge number of parameters, when trained on near internet-sized number of tokens, have been empirically shown to obey neural scaling laws: specifically, their performance behaves predictably as a power law in either parameters or dataset size until bottlenecked by the other resource. To understand this better, we first identify the necessary properties allowing such scaling laws to arise and then propose a statistical model -- a joint generative data model and random feature model -- that captures this neural scaling phenomenology. By solving this model in the dual limit of large training set size and large number of parameters, we gain insight into (i) the statistical structure of datasets and tasks that lead to scaling laws, (ii) the way nonlinear feature maps, such as those provided by neural networks, enable scaling laws when trained on these datasets, (iii) the optimality of the equiparameterization scaling of training sets and parameters, and (iv) whether such scaling laws can break down and how they behave when they do. Key findings are the manner in which the power laws that occur in the statistics of natural datasets are extended by nonlinear random feature maps and then translated into power-law scalings of the test loss and how the finite extent of the data's spectral power law causes the model's performance to plateau.
Fair Densities via Boosting the Sufficient Statistics of Exponential Families
We introduce a boosting algorithm to pre-process data for fairness. Starting from an initial fair but inaccurate distribution, our approach shifts towards better data fitting while still ensuring a minimal fairness guarantee. To do so, it learns the sufficient statistics of an exponential family with boosting-compliant convergence. Importantly, we are able to theoretically prove that the learned distribution will have a representation rate and statistical rate data fairness guarantee. Unlike recent optimization based pre-processing methods, our approach can be easily adapted for continuous domain features. Furthermore, when the weak learners are specified to be decision trees, the sufficient statistics of the learned distribution can be examined to provide clues on sources of (un)fairness. Empirical results are present to display the quality of result on real-world data.
Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from the Politically Incorrect Board
This paper presents a dataset with over 3.3M threads and 134.5M posts from the Politically Incorrect board (/pol/) of the imageboard forum 4chan, posted over a period of almost 3.5 years (June 2016-November 2019). To the best of our knowledge, this represents the largest publicly available 4chan dataset, providing the community with an archive of posts that have been permanently deleted from 4chan and are otherwise inaccessible. We augment the data with a set of additional labels, including toxicity scores and the named entities mentioned in each post. We also present a statistical analysis of the dataset, providing an overview of what researchers interested in using it can expect, as well as a simple content analysis, shedding light on the most prominent discussion topics, the most popular entities mentioned, and the toxicity level of each post. Overall, we are confident that our work will motivate and assist researchers in studying and understanding 4chan, as well as its role on the greater Web. For instance, we hope this dataset may be used for cross-platform studies of social media, as well as being useful for other types of research like natural language processing. Finally, our dataset can assist qualitative work focusing on in-depth case studies of specific narratives, events, or social theories.
MusPy: A Toolkit for Symbolic Music Generation
In this paper, we present MusPy, an open source Python library for symbolic music generation. MusPy provides easy-to-use tools for essential components in a music generation system, including dataset management, data I/O, data preprocessing and model evaluation. In order to showcase its potential, we present statistical analysis of the eleven datasets currently supported by MusPy. Moreover, we conduct a cross-dataset generalizability experiment by training an autoregressive model on each dataset and measuring held-out likelihood on the others---a process which is made easier by MusPy's dataset management system. The results provide a map of domain overlap between various commonly used datasets and show that some datasets contain more representative cross-genre samples than others. Along with the dataset analysis, these results might serve as a guide for choosing datasets in future research. Source code and documentation are available at https://github.com/salu133445/muspy .
On the Power of the Weisfeiler-Leman Test for Graph Motif Parameters
Seminal research in the field of graph neural networks (GNNs) has revealed a direct correspondence between the expressive capabilities of GNNs and the k-dimensional Weisfeiler-Leman (kWL) test, a widely-recognized method for verifying graph isomorphism. This connection has reignited interest in comprehending the specific graph properties effectively distinguishable by the kWL test. A central focus of research in this field revolves around determining the least dimensionality k, for which kWL can discern graphs with different number of occurrences of a pattern graph P. We refer to such a least k as the WL-dimension of this pattern counting problem. This inquiry traditionally delves into two distinct counting problems related to patterns: subgraph counting and induced subgraph counting. Intriguingly, despite their initial appearance as separate challenges with seemingly divergent approaches, both of these problems are interconnected components of a more comprehensive problem: "graph motif parameters". In this paper, we provide a precise characterization of the WL-dimension of labeled graph motif parameters. As specific instances of this result, we obtain characterizations of the WL-dimension of the subgraph counting and induced subgraph counting problem for every labeled pattern P. We additionally demonstrate that in cases where the kWL test distinguishes between graphs with varying occurrences of a pattern P, the exact number of occurrences of P can be computed uniformly using only local information of the last layer of a corresponding GNN. We finally delve into the challenge of recognizing the WL-dimension of various graph parameters. We give a polynomial time algorithm for determining the WL-dimension of the subgraph counting problem for given pattern P, answering an open question from previous work.
Fractal Patterns May Unravel the Intelligence in Next-Token Prediction
We study the fractal structure of language, aiming to provide a precise formalism for quantifying properties that may have been previously suspected but not formally shown. We establish that language is: (1) self-similar, exhibiting complexities at all levels of granularity, with no particular characteristic context length, and (2) long-range dependent (LRD), with a Hurst parameter of approximately H=0.70. Based on these findings, we argue that short-term patterns/dependencies in language, such as in paragraphs, mirror the patterns/dependencies over larger scopes, like entire documents. This may shed some light on how next-token prediction can lead to a comprehension of the structure of text at multiple levels of granularity, from words and clauses to broader contexts and intents. We also demonstrate that fractal parameters improve upon perplexity-based bits-per-byte (BPB) in predicting downstream performance. We hope these findings offer a fresh perspective on language and the mechanisms underlying the success of LLMs.
Categorical Stochastic Processes and Likelihood
In this work we take a Category Theoretic perspective on the relationship between probabilistic modeling and function approximation. We begin by defining two extensions of function composition to stochastic process subordination: one based on the co-Kleisli category under the comonad (Omega x -) and one based on the parameterization of a category with a Lawvere theory. We show how these extensions relate to the category Stoch and other Markov Categories. Next, we apply the Para construction to extend stochastic processes to parameterized statistical models and we define a way to compose the likelihood functions of these models. We conclude with a demonstration of how the Maximum Likelihood Estimation procedure defines an identity-on-objects functor from the category of statistical models to the category of Learners. Code to accompany this paper can be found at https://github.com/dshieble/Categorical_Stochastic_Processes_and_Likelihood
An Introduction to Conditional Random Fields
Often we wish to predict a large number of variables that depend on each other as well as on other observed variables. Structured prediction methods are essentially a combination of classification and graphical modeling, combining the ability of graphical models to compactly model multivariate data with the ability of classification methods to perform prediction using large sets of input features. This tutorial describes conditional random fields, a popular probabilistic method for structured prediction. CRFs have seen wide application in natural language processing, computer vision, and bioinformatics. We describe methods for inference and parameter estimation for CRFs, including practical issues for implementing large scale CRFs. We do not assume previous knowledge of graphical modeling, so this tutorial is intended to be useful to practitioners in a wide variety of fields.
The order in speech disorder: a scoping review of state of the art machine learning methods for clinical speech classification
Background:Speech patterns have emerged as potential diagnostic markers for conditions with varying etiologies. Machine learning (ML) presents an opportunity to harness these patterns for accurate disease diagnosis. Objective: This review synthesized findings from studies exploring ML's capability in leveraging speech for the diagnosis of neurological, laryngeal and mental disorders. Methods: A systematic examination of 564 articles was conducted with 91 articles included in the study, which encompassed a wide spectrum of conditions, ranging from voice pathologies to mental and neurological disorders. Methods for speech classifications were assessed based on the relevant studies and scored between 0-10 based on the reported diagnostic accuracy of their ML models. Results: High diagnostic accuracies were consistently observed for laryngeal disorders, dysarthria, and changes related to speech in Parkinsons disease. These findings indicate the robust potential of speech as a diagnostic tool. Disorders like depression, schizophrenia, mild cognitive impairment and Alzheimers dementia also demonstrated high accuracies, albeit with some variability across studies. Meanwhile, disorders like OCD and autism highlighted the need for more extensive research to ascertain the relationship between speech patterns and the respective conditions. Conclusion: ML models utilizing speech patterns demonstrate promising potential in diagnosing a range of mental, laryngeal, and neurological disorders. However, the efficacy varies across conditions, and further research is needed. The integration of these models into clinical practice could potentially revolutionize the evaluation and diagnosis of a number of different medical conditions.
Probabilistic Partitive Partitioning (PPP)
Clustering is a NP-hard problem. Thus, no optimal algorithm exists, heuristics are applied to cluster the data. Heuristics can be very resource-intensive, if not applied properly. For substantially large data sets computational efficiencies can be achieved by reducing the input space if a minimal loss of information can be achieved. Clustering algorithms, in general, face two common problems: 1) these converge to different settings with different initial conditions and; 2) the number of clusters has to be arbitrarily decided beforehand. This problem has become critical in the realm of big data. Recently, clustering algorithms have emerged which can speedup computations using parallel processing over the grid but face the aforementioned problems. Goals: Our goals are to find methods to cluster data which: 1) guarantee convergence to the same settings irrespective of the initial conditions; 2) eliminate the need to establish the number of clusters beforehand, and 3) can be applied to cluster large datasets. Methods: We introduce a method that combines probabilistic and combinatorial clustering methods to produce repeatable and compact clusters that are not sensitive to initial conditions. This method harnesses the power of k-means (a combinatorial clustering method) to cluster/partition very large dimensional datasets and uses the Gaussian Mixture Model (a probabilistic clustering method) to validate the k-means partitions. Results: We show that this method produces very compact clusters that are not sensitive to initial conditions. This method can be used to identify the most 'separable' set in a dataset which increases the 'clusterability' of a dataset. This method also eliminates the need to specify the number of clusters in advance.
An Alternative Framework for Time Series Decomposition and Forecasting and its Relevance for Portfolio Choice: A Comparative Study of the Indian Consumer Durable and Small Cap Sectors
One of the challenging research problems in the domain of time series analysis and forecasting is making efficient and robust prediction of stock market prices. With rapid development and evolution of sophisticated algorithms and with the availability of extremely fast computing platforms, it has now become possible to effectively extract, store, process and analyze high volume stock market time series data. Complex algorithms for forecasting are now available for speedy execution over parallel architecture leading to fairly accurate results. In this paper, we have used time series data of the two sectors of the Indian economy: Consumer Durables sector and the Small Cap sector for the period January 2010 to December 2015 and proposed a decomposition approach for better understanding of the behavior of each of the time series. Our contention is that various sectors reveal different time series patterns and understanding them is essential for portfolio formation. Further, based on this structural analysis, we have also proposed several robust forecasting techniques and analyzed their accuracy in prediction using suitably chosen training and test data sets. Extensive results are presented to demonstrate the effectiveness of our propositions.
A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT
Prompt engineering is an increasingly important skill set needed to converse effectively with large language models (LLMs), such as ChatGPT. Prompts are instructions given to an LLM to enforce rules, automate processes, and ensure specific qualities (and quantities) of generated output. Prompts are also a form of programming that can customize the outputs and interactions with an LLM. This paper describes a catalog of prompt engineering techniques presented in pattern form that have been applied to solve common problems when conversing with LLMs. Prompt patterns are a knowledge transfer method analogous to software patterns since they provide reusable solutions to common problems faced in a particular context, i.e., output generation and interaction when working with LLMs. This paper provides the following contributions to research on prompt engineering that apply LLMs to automate software development tasks. First, it provides a framework for documenting patterns for structuring prompts to solve a range of problems so that they can be adapted to different domains. Second, it presents a catalog of patterns that have been applied successfully to improve the outputs of LLM conversations. Third, it explains how prompts can be built from multiple patterns and illustrates prompt patterns that benefit from combination with other prompt patterns.
deep-significance - Easy and Meaningful Statistical Significance Testing in the Age of Neural Networks
A lot of Machine Learning (ML) and Deep Learning (DL) research is of an empirical nature. Nevertheless, statistical significance testing (SST) is still not widely used. This endangers true progress, as seeming improvements over a baseline might be statistical flukes, leading follow-up research astray while wasting human and computational resources. Here, we provide an easy-to-use package containing different significance tests and utility functions specifically tailored towards research needs and usability.
