This page provides access to wordnets in a variety of languages, all linked to the Princeton Wordnet of English (PWN). The individual wordnets have been made by many different projects and vary greatly in size and accuracy. This page has (i) extracted and normalized the data, (ii) linked it to Princeton WordNet 3.0 and (iii) put it in one place. This page only includes those with a license that allows redistribution (Bond and Paik, 2012).
Search in OMW 1.3 (human curated)
We extended this with automatically extracted data for over 150 languages from Wiktionary and the Unicode Common Locale Data Repository (Bond and Foster, 2013).
Search in Extended OMW 1.3 (with Wiktionary and UCLDR data)
nltk.download("wordnet") nltk.download("omw-1.4") # if you want the wiktionary data nltk.download("extended_omw")
Wordnet | Lang | Synsets | Words | Senses | Core | Licence |
---|---|---|---|---|---|---|
Albanet | als | 4,675 | 5,988 | 9,599 | 31% | CC BY 3.0 |
Arabic WordNet (AWN v2) | arb | 9,916 | 17,785 | 37,335 | 47% | CC BY SA 3.0 |
BulTreeBank Wordnet (BTB-WN) | bul | 4,959 | 6,720 | 8,936 | 99% | CC BY 3.0 |
Chinese Open Wordnet | cmn | 42,312 | 61,533 | 79,809 | 100% | wordnet |
Chinese Wordnet (Taiwan) | cmn | 4,913 | 3,206 | 8,069 | 28% | wordnet |
DanNet | dan | 4,476 | 4,468 | 5,859 | 81% | wordnet |
Greek Wordnet | ell | 18,049 | 18,227 | 24,106 | 57% | Apache 2.0 |
Princeton WordNet | eng | 117,659 | 148,730 | 206,978 | 100% | wordnet |
Persian Wordnet | fas | 17,759 | 17,560 | 30,461 | 41% | Free to use |
FinnWordNet | fin | 116,763 | 129,839 | 189,227 | 100% | CC BY 3.0 |
WOLF (Wordnet Libre du Français) | fra | 59,091 | 55,373 | 102,671 | 92% | CeCILL-C |
Hebrew Wordnet | heb | 5,448 | 5,325 | 6,872 | 27% | wordnet |
Croation Wordnet | hrv | 23,120 | 29,008 | 47,900 | 100% | CC BY 3.0 |
MultiWordNet | ita | 35,001 | 41,855 | 63,133 | 83% | CC BY 3.0 |
Japanese Wordnet | jpn | 57,184 | 91,964 | 158,069 | 95% | wordnet |
Multilingual Central Repository | cat | 45,826 | 46,531 | 70,622 | 81% | CC BY 3.0 |
Multilingual Central Repository | eus | 29,413 | 26,240 | 48,934 | 71% | CC BY 3.0 |
Multilingual Central Repository | glg | 19,312 | 23,124 | 27,138 | 36% | CC BY 3.0 |
Multilingual Central Repository | spa | 38,512 | 36,681 | 57,764 | 76% | CC BY 3.0 |
Wordnet Bahasa | ind | 38,085 | 36,954 | 106,688 | 94% | MIT |
Wordnet Bahasa | zsm | 36,911 | 33,932 | 105,028 | 96% | MIT |
Norwegian Wordnet | nno | 3,671 | 3,387 | 4,762 | 66% | wordnet |
Norwegian Wordnet | nob | 4,455 | 4,186 | 5,586 | 81% | wordnet |
plWordNet | pol | 33,826 | 45,387 | 52,378 | 54% | wordnet |
OpenWN-PT | por | 43,895 | 54,071 | 74,012 | 84% | CC BY SA |
sloWNet | slv | 42,583 | 40,233 | 70,947 | 86% | CC BY SA 3.0 |
Swedish (SALDO) | swe | 6,796 | 5,824 | 6,904 | 99% | CC BY 3.0 |
Thai Wordnet | tha | 73,350 | 82,504 | 95,517 | 81% | wordnet |
Language codes are linked to the English Wikipedia.
Synsets marked with ✪ are in the semi-automatically compiled list of 5000 core word senses in Princeton WordNet (approximately the 5000 most frequently used word senses). They are marked with ✪ in the interface. The original list is here from http://wordnetcode.princeton.edu/standoff-files/core-wordnet.txt (Boyd-Graber et al., 2008). Our version (converted to use collaborative interlingual index).
The wordnets are linked to the Suggested Upper Merged Ontology (Sumo: Niles and Pease, 2001; Pease, 2011); the TempoWordNet (Dias et al., 2014); the Multilingual, layered sentiment lexicons (ML-SentiCon: Cruz et al., 2014); and SentiWordNet3.0 (Baccianella et al., 2010).
Mapping between wordnet versions was done using the mappings from TALP at UPC (Daudé et al. 2000).
The wn-data-*.tab files are tab separated files of synset-lemma pairs; or synset-subid-definition/example
# name␉lang␉url␉license offset-pos␉lang:lemma␉word offset-pos␉lang:def␉sid␉definition offset-pos␉lang:exe␉sid␉example ...
name | the name of the project |
lang | the iso 3 letter code for the name |
url | the url of the project |
license | a short name for the license |
offset | the Princeton WordNet 3.0 offset 8 digit offset |
pos | one of [a,v,n,r] (we treat 's' as 'a') |
lemma | the lemma (word separator normalized to ' ') |
sid a | the sub id of the definition/example (starting from 0) |
# Wordnet Bahasa ind http://wn-msa.sourceforge.net/ MIT 00019613-n ind:def 0 masalah fisik yang nyata 00019613-n ind:lemma inti 00019613-n ind:lemma unsur 11407591-n ind:def 0 Novelis dan kritikus Perancis 11407591-n ind:def 1 pembela Dreyfus 11407591-n ind:lemma Emile Zola 11407591-n ind:lemma Zola
For this data to be really useful you need to combine it with the synset relations from the Princeton wordnet.
There are some places where I made changes to harmonize different wordnets:
Source code hosted at https://github.com/omwn/omwn.github.io.