Natural Language Processing (NLP)

Regroupement 3 is launching a Call for Participation to support up to four one-year projects, each eligible for funding of up to $25,000. Learn more >>

Improving the adaptability, reliability and robustness of LLMs.

Vision

he development of large language models (LLMs) has led to a significant shift in natural language processing (NLP) and artificial intelligence, owing to their unprecedented performance in various applications. The great success of these models is mainly due to their ability to encode large-scale data in billions of model parameters. However, this progress has come with a set of new significant challenges that Regroupement 3 (R3) will tackle.

Objectives

Design solutions, both algorithmic and resource, that:

Facilitate adaptation of LLMs, especially at the intersection of interdisciplinary domains.
Allow modular architectures to embrace a diversity of solutions to solve a problem at hand.
Allow integration of diverse and expert knowledge into the models.
Increase interpretability and trustworthiness of the models.
Preserve and document low-resource and Indigenous languages, and foster language understanding in the context of applications relevant to the Indigenous communities.

Challenges

Finding a methodology to enhance LLMs’ robustness to tasks, domains, languages, cultures and other variables. Although LLM are generalist agents capable of solving several tasks, they are often deployed in new domains for new tasks. However, fine-tuning on the new domain may not always be feasible due to several reasons. This notion of new domains can be extended to other concepts such as new languages, new cultures or even out-of-domain distributions.

Creating modular designs for LLMs able to exploit task-specific agents, APIs, modules, data points and other models and creating reusable and interpretable architectures. LLMs/Vision-language models (VLMs) can be considered as generalist experts or task-oriented experts, specialized towards some modalities (language, images, videos). However, they are often presented as monolithic models that are hard to adapt or integrate or do not allow for an easy decomposition of reasoning processes.

Being able to decompose the knowledge required for a task, to reuse trustworthy knowledge from external knowledge bases and designing non-parametric memories as external modules as well as methods for choosing to recur to parametric versus non-parametric knowledge. A significant component in LLMs success is the knowledge encoded in their weights. However, this parametric knowledge also contributes to the adaptation challenges of LLMs and their non-modularity, as parametric knowledge leads to hallucinations, inaccurate and outdated knowledge, hardly interpretable results and the need to periodically retrain models on new domains and tasks.

Investigating methods to enhance the trustworthiness of the responses generated by LLMs, and exploring the design of interpretable and controllable embeddings. One of the main reasons for creating modular multi-agent architectures with external memories is to improve the interpretability of results and the safety of LLMs, and to provide a human-computer interface that increases the general public’s confidence in AI-based models.

Developing techniques and methodologies for Indigenous, endangered, and lower-resource languages and language varieties preventing erasure of language varieties and promoting accessible and collaborative computational tools for organizing, searching, and managing language data for these languages.

Research Axes

Theme 1: LLM Adaptation and Robustness

This theme will investigate retrieval-augmented domain adaptation and cultural adaptation for LLMs and VLMs.

Axis 1: Domain Adaptation Methods
Axis 2: Cultural Adaptation
Axis 3: Evaluation Benchmark

Theme 2: Modular LLMs

This theme will focus on composition, inference, and learning methods for exploiting modularity in LLMs.

Axis 1: Integrating Experts
Axis 2: Learning Task Decomposition
Axis 3: Diversity through LLM Ensembles

Theme 3: Non-parametric LLMs/VLMs

This theme will explore how non-parametric memories based on knowledge bases (KBs) can be designed and leveraged to provide up-to-date knowledge, knowledge searching, editing and caching capabilities, knowledge aggregation capabilities for tasks that require a combination of knowledge from various sources and interpretability mechanisms for LLMs.

Axis 1: Non-parametric Memory Structures for LLMs
Axis 2: Task Decomposition and Knowledge Integration Techniques
Axis 3: Development of the new NLP Benchmark

Theme 4: LLM Interpretability and Safety

This theme will improve the interpretability and trustworthiness of machine learning models and design novel interpretability methods for machine learning systems.

Axis 1: Text Embeddings
Axis 2: Making LLMs more trustworthy

Theme 5: Indigenous and Low Resources Languages

This theme will explore low-resource techniques and methodologies for processing language varieties, and develop collaborative generative databases for understudied languages to promote linguistic diversity and prevent language varieties from being erased.

Axis 1: Adaptation between Language Varieties
Axis 2: Cultural and Social Contexts in LLMs
Axis 3: Encoding Prior Knowledge for Linguistic Discovery

Anticipated Impact

LLMs/VLMs are widely used in many research fields. R3 aims to provide agency to these groups of researchers to determine what it means for an LLM to perform well or be useful for their set of problems. The success in this proposal will lead to the widespread adoption of LLMs in many research fields.
Modularity enables those with limited resources (including small companies and start-ups) to make significant contributions to state-of-the-art systems by focusing on building specific tools or modules, and enabling the exploitation of existing pre-trained LLMs. It also has the potential to improve training efficiency and reduce environmental impact.
This proposal directly addresses knowledge hallucination issues by designing methods that constrain both the knowledge usable by an LLM and the generation methods of the LLMs based on non-parametric memories, thus leading to non-parametric LLMs.
Improving the interpretability and explainability of models and LLMs contributes to improving their quality (e.g., by understanding their current failures), usefulness (e.g., making them easier to use in high-stakes domains), and transparency.
Support communities trying to preserve and revive endangered languages by providing them with the resources not only to document and record their language, but also to design lessons for future generations.

Ongoing Projects

Retrieval-augmented domain adaptation: benchmark and methods
(Theme 1)

Siva Reddy – McGill University
Sarath Chandar – Polytechnique Montréal
Laurent Charlin – HEC Montréal
Philippe Langlais – Université de Montréal
Bang Liu – Université de Montréal
Jian-Yun Nie – Université de Montréal
Chris Pal – Polytechnique Montréal
Bang Liu – Université de Montréal
Amal Zouaq – Polytechnique Montréal

Enhancing cultural awareness and learning robust fine-grained representations in vision-language models (VLMs)
(Theme 1)

Aishwarya Agrawal – Université de Montréal
Siva Reddy –McGill University

Modular LLMs: Exploiting Modularity in LLM: composition, inference and learning
(Theme 2)

Aaron Courville – Université de Montréal
Chris Pal – Polytechnique Montréal

Leveraging knowledge bases for non-parametric LLMs
(Theme 3)

Amal Zouaq – Polytechnique Montréal
Sarath Chandar – Polytechnique Montréal
Jian-Yun Nie – Université de Montréal

Improving the interpretability and trustworthiness of ML models
(Theme 4)

Laurent Charlin – HEC Montréal
Aishwarya Agrawal – Université de Montréal
Sarath Chandar – Polytechnique Montréal

Collaborative Generative Databases for Understudied Languages
(Theme 5)

Alan Bale – Concordia University
Jessica Coon – McGill University
James Crippen – McGill University

Low-resource techniques and methodologies for processing language varieties
(Subteam 1 – adaptation between language varieties and cultural and social contexts in LLMs)
(Theme 5)

Richard Khoury – Université Laval
Leila Kosseim – Concordia University

Low-resource techniques and methodologies for processing language varieties
(Subteam 2 – encoding prior knowledge for linguistic discovery and LLM for machine translation of IEL languages)
(Theme 5)

David Adelani – McGill University
James Crippen – McGill University
Fatiha Sadat – Université du Québec à Montréal

Research Team

Co-leaders

Researchers

David Adelani – McGill University
Aishwarya Agrawal – Université de Montréal
Alan Bale – Concordia University
Sarath Chandar – Polytechnique Montréal
Laurent Charlin – HEC Montréal
Jackie Cheung – McGill University
Jessica Coon – McGill University
Aaron Courville – Université de Montréal
James Crippen – McGill University
Joel Dunham – CircleCL
Richard Khoury – Université Laval
Leila Kosseim – Concordia University
Philippe Langlais – Université de Montréal
Bang Liu – Université de Montréal
Jian-Yun Nie – Université de Montréal
Chris Pal – Polytechnique Montréal
Fatiha Sadat – UQAM
Jenneke van der Wal – Leiden University

Research Advisor

Danielle Maia de Souza: danielle.maia.de.souza@ivado.ca