Cross-linguistic Semantic and Syntactic Representation


2 Minute read

This is one of two projects researching the semantics and syntaxes of different languages. The Hebrew University of Jerusalem is the home institution for this project. View the Melbourne-based partner project.

The goals of this project are to:

  • Study cross-linguistic alignment and divergence patterns through parallel corpora
  • Develop more effective mappings of distributional spaces across languages

The details

The technological and theoretical importance of cross-linguistic applicability in semantic and syntactic representation has long been recognized, but achieving this goal has proved extremely difficult. The project will make progress towards a definition of a semantic and syntactic scheme that can be applied consistently across languages, by building on two major bodies of work:

  1. At the lexical level, we will build on the expanding work on mapping the semantic spaces of different languages [17, 18, 19]. Despite considerable success and interest of the research community in these methods [20], and their value for downstream applications, such as the automatic compilation of a multilingual dictionary, the developed approaches suffer from making simplistic assumptions as to the nature of the mapping between the semantic spaces of different languages [21].
  2. At the sentence level, we will build on the Universal Dependencies scheme for syntactic representation [22], and the UCCA scheme [23] for semantic representation. Both approaches build on linguistic typological work, and have been applied to a number of languages. However, the schemes remain coarse-grained in their categories, and the relation between the sentence and lexical level remains mostly unexplored.

Studying Cross-linguistic Alignment and Divergence Patterns through Parallel Corpora. The development of the Universal Dependencies (UD) and UCCA annotation schemes provides a basis for statistical in-depth studies of cross-linguistic syntactic divergences based on data from parallel corpora. This constitutes an improvement over traditional feature-based studies that treat languages as vectors of categorical features (as languages are represented, e.g., in databases such as WALS or AutoTyp). However, existing studies are mostly based on summary statistics over parallel corpora, such as relative frequencies of different word-order patterns, and do not reflect fine- grained cross linguistic mappings that are very important both for linguistic typology and practical NLP applications. For example,this methodology cannot directly detect that English nominal compounds and nominal-modification constructions are often translated with Russian adjectival-modification constructions or that English adjectival-modification and nominal-modification constructions routinely give rise to Korean relative clauses.

Preliminary work in Omri’s lab has manually word-aligned a subset of the Parallel Universal Dependencies corpus collection and conducted a quantitative and qualitative study based on it. The proposed project will not only extend the analysis to additional language pairs and to the use of UCCA categories, but also refine the representation with finer-grained distinctions, based on other sentence level schemes, such as AMR [24]. Moreover, the project will extend the analysis to include differences in the lexical semantics of the two languages, using an induced mapping between the distributional spaces of these languages.

Richer Mappings of Distributional Spaces across Languages. A complementary effort to studying the semantic mappings across languages by aligning parallel corpora, is aligning the vector space representations induced from monolingual data in each language. We will go beyond current approaches that attempt to find a global mapping of distributional spaces, mostly in terms of orthogonal linear transformations between the spaces. Instead, we will adopt a non-linear approach, based on topological data science theory.

The project also studies the relation between syntactic and lexical differences between languages, with the goal of understanding how both types of differences shape the geometry and topology of the embedding spaces of different languages.

Supervision team

Hebrew University of Jerusalem supervisor:
Dr Omri Abend

University of Melbourne supervisor:
Dr Lea Frermann

First published on 31 August 2022.

Share this article