Smart Language Understanding for Better Access to Knowledge and Communication

Cette page dédiée présente l’un des thème / domaine stratégique actuellement en discussion dans le cadre de notre Programme de financement de recherche stratégique. L’ensemble des thèmes / domaines en discussion est indiqué sur la page du programme. Chaque page dédiée (y compris celle-ci) peut être à un niveau de détail et de maturité variable. Pour participer à la discussion sur ce thème ou en proposer de nouveaux, veuillez utiliser ce formulaire. Si vous souhaitez être tenu.e au courant des développements autour de ce thème / domaine, inscrivez-vous ci-dessous.

Smart Language Understanding for Better Access to Knowledge and Communication

Description et justification du domaine

Natural language processing (NLP) is an ever-growing field at the intersection of linguistics, artificial intelligence and machine learning. It covers problems that are both scientifically challenging and commercially important, such as machine translation, question answering, text summarization, information retrieval (e.g., search engines) and dialogue agents, commonly known as chatbots. Although impressive progress has been made in NLP, it is generally believed that general language understanding is amongst the most difficult tasks within the domain of AI. For example, it is still difficult for a system to capture the semantic information described in a text and to reason about it using both common and domain-specific knowledge. The capability for language understanding is key to smart information access, knowledge extraction and human-system interactions.

The project will cover foundational aspects of NLP related to language representation, language generation, knowledge extraction and reasoning, question answering, and dialogue systems. We will develop new models to address fundamental challenges faced by our industrial partners regarding how language is represented and encoded in their NLP pipelines. In particular, we will strive to enable faster domain adaptation to specialize to a specific domain’s needs, and to better support multilingual input (including Canadian French) and rare words. Dialogue systems that are capable of leveraging and manipulating structured knowledge bases are also required by our industrial partners. New deep learning models that can reason jointly over text and knowledge bases will be developed. This is the basis for the democratization of access to the enormous amount of knowledge stored on the web and in our partners’ resources.

(Ajout 22/07) Ce projet est stratégique pour les compagnies canadiennes en NLP, car ces technologies deviennent de plus en plus importantes dans plusieurs services. Pour rester compétitives, les compagnies canadiennes doivent donc être en mesure de développer et d’appliquer des technologies avancées en NLP. Comme ce domaine est présentement en ébullition, qu’il manque de main d’œuvre spécialisée dans les entreprises et pour que le Canada puisse tirer profit de ces nouvelles technologies, ce type de projet de recherche collaboratif est primordial pour assurer l’innovation et un transfert technologique efficace entre les universités de les entreprises.

Canada has established itself as a leader in AI, machine learning, and deep learning, thanks in part to previous investments in time and effort in these areas by our government and researchers. In Montréal, in particular, AI-based companies are fuelling a local boom in the economy, with academic researchers at its core. Recent investments have increased the capacity of fundamental research in AI as measured by the number of faculty members. Montréal is also a well-known hub in Canada for research in NLP. World-class researchers are conducting innovative work in various areas of NLP and contributing to industrial projects through their individual collaborations with industry. Now is the time to make a sustained effort and to invest in order to successfully translate this academic strength into an enduring AI/NLP ecosystem that will include industry. This proposal, aimed in particular at the NLP applications of advances in deep learning, is precisely focused on accelerating the development of algorithms and techniques that would immediately benefit local companies whose focus and products rely on natural language interaction. In turn, this research direction will expedite the implementation of these novel algorithms into viable, commercial products.

The industrial benefits of this project are undeniable for the partners working in this field. Indeed, our proposal allows direct contact with a diversified team of outstanding researchers who can address industry’s concerns and keep their business goals in line with the rapidly evolving field of NLP. These companies will thus be able to improve the quality of their products and services and contribute to Canada’s economic development. In some cases, it will help them to stay relevant in a global information world. We believe that this research project will lead to the development of improved or wholly new products that will in time open up new markets, thus increasing the number of high valued employment opportunities in Canada.

With one of the largest academic research communities in the world, the proposed project will complement and augment ongoing initiatives related to NLP that are already occurring in Montreal and in Quebec. Most prominently, Montreal is home to the Mila AI Institute of Quebec, headed by Yoshua Bengio and one of the largest academic deep learning labs in the world, as well as IVADO, which aims to bring together industry professionals and academic researchers to develop cutting-edge expertise in machine learning. For example, in its bank of expertise, IVADO has 73 professors in Quebec having publications in Natural Language Processing and/or Information Retrieval. Researchers affiliated with both of these institutes have expertise and significant interest in developing the next-generation NLP technologies. Until now, they have been solicited individually for various projects with industry. This project will provide a cohesive unit to organize these NLP-related efforts, develop fundamental techniques that can be shared, as well as critical funds for training the next generation of NLP researchers. This project will also provide a concrete channel to coordinate knowledge transfer to industry, in collaboration with Mila and IVADO. We have already set the basis for these collaboration and transfert activities by organizing meet-ups every month to present to the community the current advances in research on specific NLP topics. We are also creating a research consortium in NLP to bring together both researchers and industrials in an informal organisation to share and influence the research activities in this important area.

LEGAL/CYBER-JUSTICE: One important benefit of improved access to knowledge and its dissemination through NLP is an increased access to justice, including social justice. For instance, language understanding of legal documents can help citizens understand the law, and predict their chances of winning a civil case (e.g consumer law, housing law, employment, discrimination). The group will work closely with the Cyberjustice lab of the Law Faculty at the Université de Montréal. Research axes will include the extraction of pertinent information from judicial texts and decisions, improvement of information retrieval, text anonymisation, and multi-label classification for legal data (with Lexum), as well as answering legal questions of citizens.

FINANCE/INSURANCE: Language understanding also allows the analysis of financial documents to detect fraudulent activities (as a collaboration with the Autorité des Marchés Financiers, the organization responsible for financial regulation in Québec). Other fruitful collaborations include analysis of public companies’ financial statements in order to evaluate their conformity with ESG (Environment, Social, Governance) principles, collaboration with OBVIA and Fin-ML (with Manuel Morales), analysis of insurance contracts and regulations (with LexRock AI), and chatbots applied to insurance (with Koïos Intelligence).

(Ajout 22/07) Les données non structurées que détiennent l’AMF (formulaires, rapports et autres déclarations) contiennent beaucoup d’informations qui pourrait être exploitées davantage grâce au traitement du langage naturel, ce qui pourrait conduire à des gains de productivité importants.

HEALTH & MEDICAL: Medical knowledge is one of the most socially important repositories of information, deserving special care for faithful retrieval and dissemination. Our proposal includes the development of telemedicine chatbots with Dialogue, the #1 provider in Canada of virtual care. Other applications include nutrition science, information access and recommendation for Mental health problems (with Myelin Solutions), access to knowledge on Covid-19 (MIMS), and using information in medical records to Identify a cohort of patients with given medical conditions for medical research (Imagia Cybernetics).

(ajout 22/07) Dans le domaine médical, il y a eu d’excellents développements sur des ontologies de rapports écrits en anglais mais très peu en français. Il y a donc un intérêt pour la majorité des projets en santé d’avoir une compréhension des écrits qui consignent tout diagnostic, tout traitement et tout suivi. Ceci représente plus de 50% du PIB du Québec. Sans une bonne compréhension de ces rapports, il est difficile d’optimiser les soins, les dépenses et les résultats auprès des patients avec des modèles en IA.

TRAINING & HUMAN-COMPUTER INTERACTION: NLP and language understanding are invaluable when imparting knowledge to students, an obvious application of knowledge dissemination. Our proposal would benefit from the collaborations of Technologies Korbit, an online learning platform, and that of CAE, for technical training pertaining to aircraft and healthcare.

MARKETING AND E-COMMERCE: NLP is often used to gain a qualitative understanding of the “why” and “what” of a situation, and enables users to make more insightful decisions. In Marketing Analytics, NLP can be used to understand the audience’s intentions so that it can create smarter, more efficient marketing strategies. NLP also plays an important role in e-commerce and recommendation: Understanding user’s behavior and preferences from text data is critical for recommending the right products to users (with Coveo, Keatext )

SEARCH ENGINE: Search engine techniques are strongly based on language understanding, which is shallow in the current state. Deeper understanding of documents and queries allows us to find information that is more semantically relevant (with Coveo).

ETHICS AND SOCIETY: Detection and correction of bias in NLP models is a more recent and important field of research, essential for an inclusive and equitable way of using AI (OBVIA). We also propose the detection of disinformation (with Société Radio-Canada) and fake news more generally.

Mots-clefs :

NLP, Natural Language Processing, Information Retrieval, Knowledge Extraction, Knowledge Modeling, Reasoning, Text Summarization, Question Answering, Dialogue Agents (Chatbot), Language Generation, Deep Learning.
Ajouts 22/07 : chatbots, conversational AI, rule-mining, sentiment analysis, aspect mining, LM-based feature extraction, knowledge modelling, explainable AI, NLP, Dialogue, Chatbot, Question Answering
Ajouts 22/07 : Santé, biomarqueurs, médicaments
Ajouts 22/07 : knowledge graph, healthcare, dialogue systems, question-answering, chatbot
Ajouts 22/07 : semantic triple, semantic map, knowledge engineering, concepts, decision-making, unsupervised learning
Ajouts 22/07 : impacts sociétaux ; éthique; diversité, inclusion, équité, relation humain/machine

Organisations pertinentes :

Industry: Airudi, Autorité des Marchés Financiers (AMF), Bciti, Bombardier, CAE, Coveo, Desjardins, Dialogue, Druide informatique, Innodirect, Irosoft, Keatext, Koïos Intelligence, LexRockAI, Lexum, Makila, Myelin, National Bank of Canada, Nu Echo, Société Radio-Canada, Technologies Korbit.

International collaborators: Center for Intelligent Information Retrieval (CIIR) at University of Massachusetts, MLIA (Machine Learning and deep learning for Information Access) at LIP6.

(Ajouts 22/07) DHDP, autorité des marchés financiers, Intact Corporation Financière. SOQUIJ, (déjà une entente cadre avec l’OBVIA) ministère de la justice, hôpitaux universitaires, Institut d’intelligence et données (https://iid.ulaval.ca/) et possibles partenaires internationaux qui collaborent avec l’OBVIA.

Personnes pertinentes suggérées durant la consultation :

Les noms suivants ont été proposés par la communauté et les personnes mentionnées ci-dessous ont accepté d’afficher publiquement leur nom. Notez cependant que tous les noms des professeur.e.s (qu’ils soient affichés publiquement ou non sur notre site web) seront transmis au comité conseil pour l’étape d’identification et de sélection des thèmes stratégiques. Notez également que les personnes identifiées durant l’étape de consultation n’ont pas la garantie de recevoir une partie du financement. Cette étape sert avant tout à présenter un panorama du domaine, incluant les personnes pertinentes et non à monter des équipes pour les programmes-cadres.

Jackie C. K. Cheung
Jian Tang
Laurent Charlin
Sarath Chandar
Simona Maria Brambati
Yacine Benahmed
Yoshua Bengio
Amal Zouaq
Hyunhwan Aiden Lee
Laurette Dubé
Manuel Morales
Martin Vallières
Jian-Yun Nie
Siva Reddy
Abdoulaye Baniré Diallo
Roxane de la Sablonnière
Éric Lacourse

Programmes-cadres potentiels

The ultimate goal of the proposed project is to better understand and use human languages to improve language-related applications involving information access (retrieval) and extraction, question answering and dialogue. As natural language is the main means used by human beings to describe and transmit information and knowledge, a smart system that deals with language data and interacts with human users must have a capability similar to that of humans in order to understand the meaning of language and to reason about it. The proposed project will address various aspects of NLP, from fundamental problems of language and knowledge modeling, information matching and reasoning, to practical utilizations in information retrieval, question answering and dialogue. A figure illustrating how these aspects are interconnected is available in the PDF version of this proposal. (See section “Remarques additionnelles confidentielles” at the end for the link to the PDF.)

The whole project is organized around 4 interconnected themes:

Knowledge extraction
Knowledge modeling
Question answering
Dialogue and chatbots

The first two themes aim at understanding the knowledge described in texts and formalize it in a model. The last two focus on general application scenarios where a user interacts with the system for different purposes, typically to satisfy some form of information need. While these themes are strongly connected, we form different sub-groups of researchers focusing on each of the themes. Each theme will be led by 2 co-leads, who drive the research on the theme and synchronize the research work. Jian-Yun Nie will act as the lead for the whole project.

Language is a common medium used to describe knowledge. In many cases, much of the knowledge we can find in a text is of general interest that can be used in many NLP tasks. For example, the knowledge that “Montreal is a city in Quebec” can help answer questions about cities in Quebec. Knowledge extraction aims to detect new pieces of knowledge from texts and to formulate it in a form usable in different applications. This can be done either in a specific domain or in open-domain. The extracted knowledge will complement the existing knowledge graphs such as Freebase, Yago, ConceptNet or UMLS in medicine. We plan to extract knowledge from various sources: general texts (e.g. webpages), specialized documents (e.g. in medicine) or user-generated contents on social media.

During knowledge extraction, it is important to detect the validity of a piece of information and knowledge to prevent from spreading false information and knowledge in inferences. Therefore, a special attention will be paid to the detection of misinformation and to measurement of trustworthiness of the knowledge.

Note – ajout du 22/07: L’aspect temporel est important dans plusieurs domaines, incluant la santé. Il est donc important de pouvoir identifier l’ordre et l’évolution de certains concepts dans les rapports.

Sub-themes

Knowledge extraction from texts
Knowledge extraction from social media
Fake news detection and trustworthiness
Discourse analysis

To be usable, knowledge must be formalized. This means not only to describe knowledge in a standard format, but also to create an appropriate representation for it. The project will investigate the common formalism of knowledge graph (interconnected entities), and try to push it further toward a better organized semantic web. The elements included in such a knowledge graph or semantic web will be encoded in a model. Graph neural networks (GNN) are commonly used for this task. In this project, we aim at creating GNNs that are suitable for different NLP tasks. In particular, we will incorporate GNNs in NLP tasks (QA) to enable multi-hop inference based on knowledge. In addition to knowledge graphs, large pre-trained language models such as BERT and GPT also encode some general knowledge in addition to language regularities. We will study inference in NLP tasks that combine knowledge graphs and pre-trained language models.

Sub-themes

Knowledge modeling
Semantic web
Neural and formal reasoning
Using knowledge in different tasks (text reasoning, QA, dialogue, etc.)

Search engines (or Information retrieval – IR) are part of our everyday life, allowing us to find information quickly from a large repository of texts (Web). However, they are limited to locating the documents that may contain the relevant information. In many cases, users are interested in finding the precise answer to a specific question. Question answering (QA) is a step further to fulfill this need. Compared to IR, QA requires a deeper understanding of the user’s question and document contents, as well as a more intelligent match between the question and the potential answers. Similar to human question answering, the understanding of the question and document contents requires capturing fine-grained semantics and exploiting all the available contextual information, while a more intelligent matching requires inferences based on the available knowledge. These raise several key questions which have not yet been well answered in the literature: how to represent questions and documents, and how to perform inference to determine the answers?

QA can be implemented on different sources: a knowledge graph, a large amount of texts, or a repository of answered questions (e.g. community QA). While the form of inference may be different, they share a common need of deeper and fine-grained representation and enhanced reasoning capability in question-answer matching.

Sub-themes:

Question answering from texts and from knowledge graphs
Passage retrieval and machine reading comprehension
Community QA
Question understanding, intent detection
Multilingual and Multimodal QA

Humans interact naturally through multi-turn dialogues. Most current systems for information and knowledge access such as search engines are designed for one-shot interactions. This limited form of interaction is not fully natural, and limits the ability for users to express precisely what is needed. In contrast, in human interactions, questions become more and more precise along with the turns of dialogue. Under this theme, we will study conversation-based human-system interactions in both task-oriented and open-domain applications contexts. In addition to merely generating a response that seems natural as in chitchat, we will perform document- and knowledge-grounded dialogue, so that the dialogue contains useful substance for the user and follows the general or domain-specific knowledge as the user may expect. Natural dialogues are also rich and heterogeneous in terms of intents: one may mix up chitchat with a search intent, a buying intent or a need for mental health healing. The detection of the underlying intent will be crucial in such a context, allowing us to trigger the appropriate process to build the response. Such a dialogue system can be used in multiple application contexts: conversational search, multi-turn question answering, conversational recommendation, and chatbot with emotional goal.

Sub-themes:

Retrieval-based and generation-based dialogue
Task-oriented, chitchat or open-domain dialogue
Document- and knowledge-grounded dialogue
Dialogue based on persona, goal and emotion
Conversational IR and recommendation
Evaluation methods

(pas de documentation complémentaire pour le moment)

Intéressé.e? Entrez votre courriel pour recevoir les mises à jour en lien avec cette page :

Historique

13 juillet 2021 : Première version

15 juillet 2021 : Ajout de personnes pertinentes

22 juillet 2021 : Compléments d’information sections “Description”, “Fields of application” et Contexte

17 août 2021 : Ajout de personnes pertinentes

Smart Language Understanding for Better Access to Knowledge and Communication