In recent years, enhancement of deep learning techniques has led to a jump in natural language processing capability, enabling computers to solve sophisticated tasks that were previously impossible. For example, the leaderboard of SQuAD (a dataset that measures the AI’s capability of comprehending text) and GLUE (a set of tasks designed to measure the AI’s capability of understanding text from various genres) show that AI outperforms humans.
At the center of such natural language processing using deep learning, are language understanding models that are trained on vast amounts of text data. It can be said that the performance of language understanding models define the performance of deep learning models.
At Studio Ousia, we have continuously developed advanced language understanding models using the vast amount of real world data stored in Wikipedia. As the result of our research, we published the multilingual language model called Wikipedia2Vec, which is available in 12 languages including English, Japanese and Chinese, in 2018. Wikipedia2Vec is trained using the text and links in Wikipedia’s entries, so that it can encode the knowledge in Wikipedia efficiently. Because Wikipedia is provided in various languages around the world, we have been able to achieve high-performance models in many languages.
As well as being a core component in our own products, the academic paper released in 2016 has been cited over 170 times, and also has been used widely by industry and academia around the world. For example, world-renowned investment firm BlackRock announced that it developed an entity detection system using Wikipedia2Vec, a crucial part of their finance system. Also, Ludwig-Maximilians-Universität München and Siemens showed in 2019 that they developed a model that outperformed other models by combining Wikipedia2Vec and the Google developed language understanding model BERT. Wikipedia2Vec has also been used in various ways by cutting edge systems of their field: analyzing movie stories, text analysis, abstracting information from healthcare documents and completing missing information in knowledge-bases.
The following is a visualization of the low-dimensionalized vectors of entities learned by Wikipedia2Vec.
We also encourage publishing our research to competitions and leaderboards. We have won 1st place 4 times, 2nd place 1 time in competitions held at world-renowned academic conferences. In 2017, we entered the Quiz Bowl held at the world’s largest academic AI conference NIPS, and out performed the other AI models, and went on to beat the human team consisting of all-american quiz champions in a landslide of 465 vs 200.
At Studio Ousia, we will continue to develop advanced language understanding model in various languages, and challenge ourselves to solve real-world business problems using our high performing models.