Developing intelligent and high-performing natural language AI in multiple languages

In recent years, enhancement of deep learning techniques has led to a jump in natural language processing capability, enabling computers to solve sophisticated tasks that were previously impossible. For example, the leaderboard of SQuAD (a dataset that measures the AI’s capability of comprehending text) and GLUE (a set of tasks designed to measure the AI’s capability of understanding text from various genres) show that AI outperforms humans.

At the center of such natural language processing using deep learning, are language understanding models that are trained on vast amounts of text data. It can be said that the performance of language understanding models define the performance of deep learning models.

At Studio Ousia, we have continuously developed advanced language understanding models using the vast amount of real world data stored in Wikipedia. As the result of our research, we published the multilingual language model called Wikipedia2Vec, which is available in 12 languages including English, Japanese and Chinese, in 2018. Wikipedia2Vec is trained using the text and links in Wikipedia’s entries, so that it can encode the knowledge in Wikipedia efficiently. Because Wikipedia is provided in various languages around the world, we have been able to achieve high-performance models in many languages.

As well as being a core component in our own products, the academic paper released in 2016 has been cited over 170 times, and also has been used widely by industry and academia around the world. For example, world-renowned investment firm BlackRock announced that it developed an entity detection system using Wikipedia2Vec, a crucial part of their finance system. Also, Ludwig-Maximilians-Universität München and Siemens showed in 2019 that they developed a model that outperformed other models by combining Wikipedia2Vec and the Google developed language understanding model BERT. Wikipedia2Vec has also been used in various ways by cutting edge systems of their field: analyzing movie stories, text analysis, abstracting information from healthcare documents and completing missing information in knowledge-bases.

The following is a visualization of the low-dimensionalized vectors of entities learned by Wikipedia2Vec.

We also encourage publishing our research to competitions and leaderboards. We have won 1st place 4 times, 2nd place 1 time in competitions held at world-renowned academic conferences. In 2017, we entered the Quiz Bowl held at the world’s largest academic AI conference NIPS, and out performed the other AI models, and went on to beat the human team consisting of all-american quiz champions in a landslide of 465 vs 200.

At Studio Ousia, we will continue to develop advanced language understanding model in various languages, and challenge ourselves to solve real-world business problems using our high performing models.


Open Source Softwares

  • Wikipedia2Vec: 単語とエンティティに関するベクトル表現をWikipediaから学習するためのツール
  • mprpc: Pythonで動作する高速なRemote procedure call (RPC) ライブラリ
  • mojimoji: Pythonで動作する高速な日本語の半角・全角文字の変換ライブラリ


Neural Attentive Bag-of-Entities Model for Text Classification

Ikuya Yamada, Hiroyuki Shindo (NAIST)
The SIGNLL Conference on Computational Natural Language Learning (CoNLL), 2019 (to appear)

Trick Me If You Can: Human-in-the-loop Generation of Adversarial Examples for Question Answering

Eric Wallace (U. Maryland), Pedro Rodriguez (U. Maryland), Shi Feng (U. Maryland), Ikuya Yamada, Jordan Boyd-Graber (U. Maryland)
Transactions of the Association for Computational Linguistics (TACL), 2019

Representation Learning of Entities and Documents from Knowledge Base Descriptions

Ikuya Yamada, Hiroyuki Shindo (NAIST), Yoshiyasu Takefuji (Keio)
International Conference on Computational Linguistics (COLING), 2018

Studio Ousia's Quiz Bowl Question Answering System

Ikuya Yamada, Ryuji Tamaki, Hiroyuki Shindo (NAIST), Yoshiyasu Takefuji (Keio)
First NIPS ’17 Competition, The Springer Series on Challenges in Machine Learning, 2018

Learning Distributed Representations of Texts and Entities from Knowledge Base

Ikuya Yamada, Hiroyuki Shindo (NAIST), Hideaki Takeda (NII), Yoshiyasu Takefuji (Keio)
Transactions of the Association for Computational Linguistics (TACL), 2017

Segment-Level Neural Conditional Random Fields for Named Entity Recognition

Motoki Sato (NAIST), Hiroyuki Shindo (NAIST), Ikuya Yamada, Yuji Matsumoto (NAIST)
International Joint Conference on Natural Language Processing (IJCNLP), 2017

Named Entity Disambiguation for Noisy Text

Yotam Eshel (Technion), Noam Cohen (Technion), Kira Radinsky (Technion, eBay), Shaul Markovitch (Technion), Ikuya Yamada, Omer Levy (University of Washington)
The SIGNLL Conference on Computational Natural Language Learning (CoNLL), 2017

Ensemble of Neural Classifiers for Scoring Knowledge Base Triples

Ikuya Yamada, Motoki Sato (NAIST), Hiroyuki Shindo (NAIST)
WSDM Cup (Cambridge, UK), 2017

Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation

Ikuya Yamada, Hiroyuki Shindo (NAIST), Hideaki Takeda (NII), Yoshiyasu Takefuji (Keio)
The SIGNLL Conference on Computational Natural Language Learning (CoNLL), (Berlin, Germany), 2016, pp.250-259

Enhancing Named Entity Recognition in Twitter Messages Using Entity Linking

Ikuya Yamada, Hideaki Takeda (NII), Yoshiyasu Takefuji (Keio)
ACL 2015 Workshop on Noisy User-generated Text (Beijing, China), 2015, pp.136-140
(Shared task winner)

An End-to-End Entity Linking Approach for Tweets

Ikuya Yamada, Hideaki Takeda (NII), Yoshiyasu Takefuji (Keio)
WWW 2015 Workshop on Making Sense of Microposts (Florence, Italy), 2015, pp.55-56
(Competition winner)

Evaluating the Helpfulness of Linked Entities to Readers

Ikuya Yamada, Tomotaka Ito, Shinsuke Takagi, Shinnosuke Usami, Hideaki Takeda (NII), Yoshiyasu Takefuji (Keio)
26th ACM Conference on Hypertext and Social Media (Santiago Downtown, Chile), 2014, pp.169-178