There has been an ever-growing interest in tasks targeting Natural Language Understanding and Reasoning. Although deep learning models have achieved human-like performance in many such tasks, it has also been repeatedly shown that they lack the precision, generalization power, reasoning capabilities, and explainability found in more traditional, symbolic approaches. Thus, current research has started employing hybrid methods, combining the strengths of each tradition and mitigating its weaknesses. This workshop would like to promote this research direction and foster fruitful dialog between the two disciplines by bringing together researchers working on hybrid methods in any subfield of Natural Language Understanding and Reasoning.
NALOMA began by focusing on bridging the gap between machine learning and natural logic but has since broadened to include works integrating symbolic methods and machine learning. Even so, combining deep learning with natural logic remains a highly promising direction. The remarkable capabilities of LLMs, which directly operate on natural language expressions, offer a distinct advantage to natural logic—a family of logics with formulas resembling natural language expressions.
The NALOMA workshop is endorsed by SIGSEM.

Workshop Program

Talk Abstracts

Keynote Slidesfile_download
Understanding Complex Situation Descriptions

Aaron Steven White

We use natural language to convey information about situations: things that happen or stuff that is true. This ability is supported by systematic relationships between the way we conceptualize situations and the way we describe them. These systematic relationships in turn underwrite inferences that go beyond what one strictly says in describing a situation. The question that motivates this talk is how to design systems that correctly capture the inferences we draw about situations on the basis of their descriptions.
Classical approaches to this question–exemplified in their modern form by graph-based representations, such Uniform Meaning Representation–attempt to capture the situation conceptualization associated with a description using a symbolic situation ontology and to draw inferences on the basis of rules stated over that ontology. An increasingly popular alternative to such ontology-factored approaches are ontology-free approaches, which attempt to directly represent inferences about a situation as natural language strings associated with a situation description, thereby bypassing the problem of engineering a situation ontology entirely.
I discuss the benefits and drawbacks of these two approaches and present case studies in synthesizing them that focus specifically on how best to capture inferences about complex situations–i.e. situations, like building a house, that themselves may be composed of subsituations, like laying the house’s foundations, framing the house, etc. I argue that we should ultimately strive for ontology-free representations but that the challenges inherent to reasoning about complex situations highlight the persistent benefits of situation ontologies in providing representational scaffolding for the construction and evaluation of such representations.

Keynote Slidesfile_download
How can large language model become more human?

Mehrnoosh Sadrzadeh (Joint work with Daphne Wang, Miloš Stanojević, Wing-Yee Chow, and Richard Breheny)

Psycholinguistic experiments reveal that efficiency of human language use is founded on predictions at both syntactic and lexical levels. Previous models of human prediction exploiting LLMs have used an information theoretic measure called surprisal, with success on naturalistic text in a wide variety of languages, but under-performance on challenging text such as garden path sentences. This paper introduces a novel framework that combines the lexical predictions of an LLM with the syntactic structures provided by a dependency parser. The framework gives rise to an Incompatibility Fraction. When tested on two garden path datasets, it correlated well with human reading times, distinguished between easy and hard garden path, and outperformed surprisal.

Keynote Slidesfile_download
Understanding the Logic of Generative AI through Logic

Kyle Richardson

Symbolic logic has long served as the de-facto language for expressing complex knowledge throughout computer science, owing to its clean semantics. Symbolic approaches to reasoning that are driven by declarative knowledge, in sharp contrast to purely machine learning-based approaches, have the advantage of allowing us to reason transparently about the behavior and correctness of the resulting systems. In this talk, we focus on the broad question: Can the declarative approach be leveraged to better understand and formally specify algorithms for large language models (LLMs)? We focus on formalizing recent direct preference alignment (DPA) loss functions, such as DPO, that are currently at the forefront of LLM alignment. Specifically, we ask: Given an existing DPA loss, can we systematically derive a symbolic expression that characterizes its semantics? We outline the details of a novel formalism we developed for these purposes. We also discuss how this formal view of preference learning sheds new light on both the size and structure of the DPA loss landscape and makes it possible to derive new alignment algorithms from first principles. Our framework and approach aim not only to provide guidance for the AI alignment community, but also to open up new opportunities for researchers in formal semantics to engage more directly with the development and analysis of LLM algorithms.

Paperfile_download Slidesfile_download
Implementing a Logical Inference System for Japanese Comparatives

Yosuke Mikami, Daiki Matsuoka, Hitomi Yanaka

Natural Language Inference (NLI) involving comparatives is challenging because it requires understanding quantities and comparative relations expressed by sentences. While some approaches leverage Large Language Models (LLMs), we focus on logic-based approaches grounded in compositional semantics, which are promising for robust handling of numerical and logical expressions. Previous studies along these lines have proposed logical inference systems for English comparatives. However, it has been pointed out that there are several morphological and semantic differences between Japanese and English comparatives. These differences make it difficult to apply such systems directly to Japanese comparatives. To address this gap, this study proposes ccg-jcomp, a logical inference system for Japanese comparatives based on compositional semantics. We evaluate the proposed system on a Japanese NLI dataset containing comparative expressions. We demonstrate the effectiveness of our system by comparing its accuracy with that of existing LLMs.

Paperfile_download Slidesfile_download
Unpacking Legal Reasoning in LLMs: Chain-of-Thought as a Key to Human-Machine Alignment in Essay-Based NLU Tasks

Ying-Chu Yu, Sieh-Chuen Huang, Hsuan-Lei Shao

This study evaluates how Large Language Models (LLMs) perform deep legal reasoning on Taiwanese Status Law questions and investigates how Chain-of-Thought (CoT) prompting affects interpretability, alignment, and generalization. Using a two-stage evaluation framework, we first decomposed six real legal essay questions into 68 sub-questions covering issue spotting, statutory application, and inheritance computation. In Stage Two, full-length answers were collected under baseline and CoT-prompted conditions. Four LLMs—ChatGPT-4o, Gemini, Grok3, and Copilot—were tested. Results show CoT prompting significantly improved accuracy for Gemini (from 83.2% to 94.5%, p < 0.05) and Grok3, with moderate but consistent gains for ChatGPT and Copilot. Human evaluation of full-length responses revealed CoT answers received notably higher scores in issue coverage and reasoning clarity, with ChatGPT and Gemini gaining +2.67 and +1.92 points respectively. Despite these gains, legal misclassifications persist, highlighting alignment gaps between surface-level fluency and expert legal reasoning. This work opens the black box of legal NLU by tracing LLM reasoning chains, quantifying performance shifts under structured prompting, and providing a diagnostic benchmark for complex, open-ended legal tasks beyond multiple-choice settings.

Paperfile_download Slidesfile_download
Dataset Creation for Visual Entailment using Generative AI

Rob Reijtenbach, Suzan Verberne, Gijs Wijnholds

In this paper we present and validate a new synthetic dataset for training visual entailment models. Existing datasets for visual entailment are small and sparse compared to datasets for textual entailment. Manually creating datasets is labor-intensive. We base our synthetic dataset on the SNLI dataset for textual entailment. We take the premise text from SNLI as input prompts in a generative image model, Stable Diffusion, creating an image to replace each textual premise. We evaluate our dataset both intrinsically and extrinsically. For extrinsic evaluation, we evaluate the validity of the generated images by using them as training data for a visual entailment classifier based on CLIP feature vectors. We find that synthetic training data only leads to a slight drop in quality on SNLI-VE, with an F-score 0.686 compared to 0.703 when trained on real data. We also compare the quality of our generated training data to original training data on another dataset: SICK-VTE. Again, there is only a slight drop in F-score: from 0.400 to 0.384. These results indicate that in settings with data sparsity, synthetic data can be a promising solution for training visual entailment models.

Paperfile_download Slidesfile_download
In the Mood for Inference: Logic-Based Natural Language Inference with Large Language Models

Bill Noble, Rasmus Blanck, Gijs Wijnholds

In this paper we explore challenging Natural Language Inference datasets with a logic-based approach and Large Language Models (LLMs), in order to assess the validity of this hybrid strategy. We report on an experiment which combines an LLM meta-prompting strategy, eliciting logical representations, and Prover9, a first-order logic theorem prover. In addition, we experiment with the inclusion of (logical) world knowledge. Our findings suggest that (i) broad performance is sometimes on par, (ii) formula generation is rather brittle, and (iii) world knowledge aids performance relative to data annotation. We argue that these results explicate the weaknesses of both approaches. As such, we consider this study a source of inspiration for future work in the field of neuro-symbolic reasoning.

Paperfile_download Slidesfile_download
Building a Compact Math Corpus

Andrea Ferreira

This paper introduces the Compact Math Corpus (CMC), a preliminary resource for natural language processing in the mathematics domain. We process three open-access undergraduate textbooks from distinct mathematical areas and annotate them in the CoNLL-U format using a lightweight pipeline based on the spaCy Small model. The structured output enables the extraction of syntactic bigrams and TF-IDF scores, supporting a syntactic-semantic analysis of mathematical sentences.From the annotated data, we construct a classification dataset comprising bigrams potentially representing mathematical concepts, along with representative example sentences. We combine CMC with the conversational corpus UD English EWT and train a logistic regression model with K-fold cross-validation, achieving a minimum macro-F1 score of 0.989. These results indicate the feasibility of automatic concept identification in mathematical texts. The study is designed for easy replication in low-resource settings and to promote sustainable research practices. Our approach offers a viable path to tasks such as parser adaptation, terminology extraction, multiword expression modeling, and improved analysis of mathematical language structures.

Abstract Slidesfile_download
MERGE: Minimal Expression-Replacement GEneralization Test for Natural Language Inference

Mădălina Zgreabăn, Tejaswini Deoskar, Lasha Abzianidze

In recent years, many NLI benchmarks have shown language models struggle to generalize well. However, these benchmarks are costly to build without automation and can result in unfair comparisons if they are not similar to the models’ training data. To bridge these gaps, we propose the Minimal Expression-Replacement GEneralization (MERGE) test where we show how to create multiple variants of NLI problems without changing their sentence length, word overlap size, or entailment label. We evaluate NLI models with these variants, where a correct prediction for an NLI problem is when all its variants are classified correctly. Our results suggest that models do not generalize well, as they fail to correctly predict variants of almost 10% of NLI problems, despite them having the same underlying reasoning.

Contributed talk Slidesfile_download
Automatic Evaluation of Linguistic Validity in Japanese CCG Treebanks

Asa Tomita, Hitomi Yanaka, Daisuke Bekki

In Natural Language Inference, the accuracy of systems based on compositional semantics depends on the quality of syntactic analysis, which in turn relies on linguistically valid training and evaluation data, typically provided by treebanks. However, conventional treebank evaluation metrics focus on data coverage and fail to assess the linguistic validity of syntactic structures. This paper proposes novel evaluation methods to enable automatic and multi-faceted assessment of linguistic validity. We apply these methods to a Japanese treebank based on Combinatory Categorial Grammar and report the evaluation results.

Contributed talk Slidesfile_download
How Often Does Natural Logic Actually Meet Machine Learning?

Lasha Abzianidze

Reflecting on the workshop title raises the question of how frequently and to what extent natural logic has been coupled with machine learning in recent years. Unlike other formal logics, natural logic can be seen as more language-friendly, with formulas typically closely following the structure of natural language expressions, and offers certain quick reasoning with natural language. But does this generally make natural logic LLM-friendly, i.e., easier to be successfully paired with LLMs? In this talk, we will survey several works that combined a version of natural logic with deep learning with varying degrees of success.

Call for Papers

The NALOMA workshop invites submissions on any (theoretical or computational) aspect of hybrid methods concerning Natural Language Understanding and Reasoning (NLU&R). The topics include but are not limited to:
  • Hybrid NLU&R systems that integrate logic-based/symbolic methods with neural networks
  • Explainable NLU&R (with structured explanations)
  • Opening the black-box of deep learning in NLU&R
  • Downstream applications of hybrid NLU&R systems
  • Probabilistic semantics for NLU&R
  • Comparison and contrast between symbolic and deep learning work on NLU&R
  • Creation, criticism, refinement, and augmentation of NLU&R datasets
  • (Dis)Alignment of humans and machines on NLU&R tasks
  • Addressing inherent human disagreements in NLU&R tasks
  • Generalization of NLU&R systems
  • Fine-grained evaluation of NLU&R systems
We invite two types of submissions:
  • Archival (long or short) papers should report on complete, original and unpublished research. Accepted papers will be published in the workshop proceedings and appear in the ACL anthology. Short and long papers may consist of up to 4 and 8 pages of content, respectively, plus unlimited references. Camera-ready versions of papers will be given one additional page of content so that reviewers' comments can be taken into account.
  • Extended abstracts may report on work in progress or work that was recently published/accepted at a different venue. Extended abstracts will not be included in the workshop proceedings. Thus, the unpublished work will retain its status and can be submitted to another venue. This webpage will link to the accepted extended abstracts. The extended abstracts should not contain an abstract section and may consist of up to 2 pages of content, plus unlimited references.
Both accepted papers and extended abstracts are expected to be presented at the workshop. Extended abstracts will be presented as talks or posters at the discretion of the program committee.
Submissions will be double-blind reviewed, and all long/short papers and extended abstracts must be anonymous, i.e. not reveal author(s) on the title page or through self-references. Both extended abstracts and papers must be formatted according to the ACL style-files or the ACL Overleaf template. All submissions must adhere to the ARR guidelines about Anonymized Review, Authorship, Citation and Comparison, and Ethics Policy (without requiring completion of the responsible NLP research checklist).
Both papers and extended abstracts should be submitted via openreview.
The workshop participants must register for ESSLLI 2025.

Important Dates

  • Deadline for papers & extended abstracts: 25 April 6 May
  • Notification: 27 May
  • ESSLLI registration dates: 31 May (early)
  • Camera-ready due: 20 June
  • Workshop: 4-8 August
  • All dates are AoE

Keynotes

Aaron Steven White
Aaron Steven White
University of Rochester
Kyle Richardson
Kyle Richardson
Allen Institute for AI
Mehrnoosh Sadrzadeh
Mehrnoosh Sadrzadeh
University College London

Program Committee

  • Lasha Abzianidze (co-chair), Utrecht University
  • Valeria de Paiva (co-chair), Topos Institute
  • Stergios Chatzikyriakidis, University of Crete
  • Aikaterini-Lida Kalouli, Bundesdruckerei GmbH
  • Katrin Erk, University of Texas at Austin
  • Hai Hu, Shanghai Jiao Tong University
  • Thomas Icard, Stanford University
  • Lawrence S. Moss, Indiana University
  • Hitomi Yanaka, University of Tokyo and Riken Institute