Technology

Developing intelligent and high-performing natural language AI in multiple languages

In recent years, enhancement of deep learning techniques has led to a jump in natural language processing capability, enabling computers to solve sophisticated tasks that were previously impossible. For example, the leaderboard of SQuAD (a dataset that measures the AI’s capability of comprehending text) and GLUE (a set of tasks designed to measure the AI’s capability of understanding text from various genres) show that AI outperforms humans.

At the center of such natural language processing using deep learning, are language understanding models that are trained on vast amounts of text data. It can be said that the performance of language understanding models define the performance of deep learning models.

At Studio Ousia, we have continuously developed advanced language understanding models using the vast amount of real world data stored in Wikipedia. As the result of our research, we published the multilingual language model called Wikipedia2Vec, which is available in 12 languages including English, Japanese and Chinese, in 2018. Wikipedia2Vec is trained using the text and links in Wikipedia’s entries, so that it can encode the knowledge in Wikipedia efficiently. Because Wikipedia is provided in various languages around the world, we have been able to achieve high-performance models in many languages.

As well as being a core component in our own products, the academic paper released in 2016 has been cited over 170 times, and also has been used widely by industry and academia around the world. For example, world-renowned investment firm BlackRock announced that it developed an entity detection system using Wikipedia2Vec, a crucial part of their finance system. Also, Ludwig-Maximilians-Universität München and Siemens showed in 2019 that they developed a model that outperformed other models by combining Wikipedia2Vec and the Google developed language understanding model BERT. Wikipedia2Vec has also been used in various ways by cutting edge systems of their field: analyzing movie stories, text analysis, abstracting information from healthcare documents and completing missing information in knowledge-bases.

The following is a visualization of the low-dimensionalized vectors of entities learned by Wikipedia2Vec.

We also encourage publishing our research to competitions and leaderboards. We have won 1st place 4 times, 2nd place 1 time in competitions held at world-renowned academic conferences. In 2017, we entered the Quiz Bowl held at the world’s largest academic AI conference NIPS, and out performed the other AI models, and went on to beat the human team consisting of all-american quiz champions in a landslide of 465 vs 200.

At Studio Ousia, we will continue to develop advanced language understanding model in various languages, and challenge ourselves to solve real-world business problems using our high performing models.

Awards

Open Source Softwares

  • Wikipedia2Vec: a tool used for obtaining embeddings of words and entities from Wikipedia.
  • mprpc: a lightweight MessagePack RPC library.
  • mojimoji: a fast converter between Japanese hankaku and zenkaku characters.

Papers

EASE: Entity-Aware Contrastive Learning of Sentence Embedding

Sosuke Nishikawa (Studio Ousia/The University of Tokyo), Ryokan Ri (Studio Ousia/The University of Tokyo), Ikuya Yamada, Yoshimasa Tsuruoka (The University of Tokyo), Isao Echizen (National Institute of Informatics)
Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2022.

Global Entity Disambiguation with BERT

Ikuya Yamada, Koki Washio (Megagon Labs) , Hiroyuki Shindo (NAIST, RIKEN), Yuji Matsumoto (RIKEN)
Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2022.

mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models

Ryokan Ri (Studio Ousia/The University of Tokyo), Ikuya Yamada, Yoshimasa Tsuruoka (The University of Tokyo)
Annual Meeting of the Association for Computational Linguistics (ACL), 2022.

Efficient Passage Retrieval with Hashing for Open-domain Question Answering

Ikuya Yamada, Akari Asai (University of Washington), Hannaneh Hajishirzi (University of Washington, Allen Institute for AI)
The Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP), 2021

LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention

Ikuya Yamada, Akari Asai (University of Washington), Hiroyuki Shindo (NAIST), Hideaki Takeda (NII) and Yuji Matsumoto (RIKEN AIP)
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020

Wikipedia2Vec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from Wikipedia

Ikuya Yamada, Akari Asai (University of Washington), Jin Sakuma (The University of Tokyo), Hiroyuki Shindo (NAIST), Hideaki Takeda (NII), Yoshiyasu Takefuji (Keio University), Yuji Matsumoto (RIKEN AIP)
Conference on Empirical Methods in Natural Language Processing (EMNLP), system demonstrations, 2020

Neural Attentive Bag-of-Entities Model for Text Classification

Ikuya Yamada, Hiroyuki Shindo (NAIST)
The SIGNLL Conference on Computational Natural Language Learning (CoNLL), 2019 (to appear)

Trick Me If You Can: Human-in-the-loop Generation of Adversarial Examples for Question Answering

Eric Wallace (U. Maryland), Pedro Rodriguez (U. Maryland), Shi Feng (U. Maryland), Ikuya Yamada, Jordan Boyd-Graber (U. Maryland)
Transactions of the Association for Computational Linguistics (TACL), 2019

Representation Learning of Entities and Documents from Knowledge Base Descriptions

Ikuya Yamada, Hiroyuki Shindo (NAIST), Yoshiyasu Takefuji (Keio)
International Conference on Computational Linguistics (COLING), 2018

Studio Ousia's Quiz Bowl Question Answering System

Ikuya Yamada, Ryuji Tamaki, Hiroyuki Shindo (NAIST), Yoshiyasu Takefuji (Keio)
First NIPS ’17 Competition, The Springer Series on Challenges in Machine Learning, 2018

Learning Distributed Representations of Texts and Entities from Knowledge Base

Ikuya Yamada, Hiroyuki Shindo (NAIST), Hideaki Takeda (NII), Yoshiyasu Takefuji (Keio)
Transactions of the Association for Computational Linguistics (TACL), 2017

Segment-Level Neural Conditional Random Fields for Named Entity Recognition

Motoki Sato (NAIST), Hiroyuki Shindo (NAIST), Ikuya Yamada, Yuji Matsumoto (NAIST)
International Joint Conference on Natural Language Processing (IJCNLP), 2017

Named Entity Disambiguation for Noisy Text

Yotam Eshel (Technion), Noam Cohen (Technion), Kira Radinsky (Technion, eBay), Shaul Markovitch (Technion), Ikuya Yamada, Omer Levy (University of Washington)
The SIGNLL Conference on Computational Natural Language Learning (CoNLL), 2017

Ensemble of Neural Classifiers for Scoring Knowledge Base Triples

Ikuya Yamada, Motoki Sato (NAIST), Hiroyuki Shindo (NAIST)
WSDM Cup (Cambridge, UK), 2017

Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation

Ikuya Yamada, Hiroyuki Shindo (NAIST), Hideaki Takeda (NII), Yoshiyasu Takefuji (Keio)
The SIGNLL Conference on Computational Natural Language Learning (CoNLL), (Berlin, Germany), 2016, pp.250-259

Enhancing Named Entity Recognition in Twitter Messages Using Entity Linking

Ikuya Yamada, Hideaki Takeda (NII), Yoshiyasu Takefuji (Keio)
ACL 2015 Workshop on Noisy User-generated Text (Beijing, China), 2015, pp.136-140
(Shared task winner)

An End-to-End Entity Linking Approach for Tweets

Ikuya Yamada, Hideaki Takeda (NII), Yoshiyasu Takefuji (Keio)
WWW 2015 Workshop on Making Sense of Microposts (Florence, Italy), 2015, pp.55-56
(Competition winner)

Evaluating the Helpfulness of Linked Entities to Readers

Ikuya Yamada, Tomotaka Ito, Shinsuke Takagi, Shinnosuke Usami, Hideaki Takeda (NII), Yoshiyasu Takefuji (Keio)
26th ACM Conference on Hypertext and Social Media (Santiago Downtown, Chile), 2014, pp.169-178