From multiword-expressions to partially schematic constructions

Chadi Ben Youssef’s work brings together three complementary perspectives—computational linguistics, corpus linguistics, and discourse analysis—unified by a shared reliance on multifactorial statistical modeling to uncover patterns of language use. After earning his Ph.D. from the University of California, Santa Barbara, he moved to Switzerland, where he is currently a postdoctoral researcher at the University of Neuchâtel. His project, Detecting connectivity changes inductively in a network of constructions, seeks to advance Diachronic Construction Grammar (Noël 2007; Traugott and Trousdale 2013; Hilpert 2021) as a general framework for understanding language change.
Research on Multiword Expressions (MWEs) has gained increasing significance over the past decade due to its relevance for fields such as lexicography, language learning and acquisition, and, more broadly, all aspects of tokenization and parsing. While recent studies have explored the use of distributional models for the discovery and identification of MWEs, much of the existing work still relies on co-occurrence frequencies and various association measures.
In this talk, I present my dissertation project, mMERGE, a corpus-driven algorithm for discovering Multiword Expressions. Building on the work of Stefan Th. Gries (2022), mMERGE proposes a recursive MWE discovery algorithm, implemented in Julia, which integrates five well-studied corpus metrics: token frequency, dispersion, type frequency, bidirectional entropy, and bidirectional association. The algorithm proceeds bottom-up, iteratively merging strongly associated sequences into increasingly complex units, thereby allowing for the identification of MWEs of varying lengths without relying on predefined lexicons.
The talk then extends this framework by introducing a construction discovery layer that moves beyond strictly contiguous expressions. Starting from high-confidence merged sequences, the algorithm is augmented to detect recurrent patterns with internal variability, identifying constructions with stable anchors and variable slots (e.g., take [NP] into account). This extension relies on the systematic exploration of interrupted realizations, the evaluation of slot coherence and boundedness, and the preservation of association strength across variants. As a result, mMERGE is reconceptualized not only as an MWE discovery tool, but as a method for uncovering partially schematic constructions, bridging the gap between corpus-driven phrase extraction and usage-based approaches to Construction Grammar.