OMW Version 1

This page provides access to wordnets in a variety of languages, all linked to the Princeton Wordnet of English (PWN). The individual wordnets have been made by many different projects and vary greatly in size and accuracy. This page has (i) extracted and normalized the data, (ii) linked it to Princeton WordNet 3.0 and (iii) put it in one place. This page only includes those with a license that allows redistribution (Bond and Paik, 2012).

Search in OMW 1.3 (human curated)

We extended this with automatically extracted data for over 150 languages from Wiktionary and the ‎Unicode Common Locale Data Repository (Bond and Foster, 2013).

Search in Extended OMW 1.3 (with Wiktionary and UCLDR data)

Data and Code

The data files are available in the OMW Data Repository
Disambiguated Swadesh lists are available here
Code for the wordnet browser is available here

You can use the OMW data in the NLTK wordnet interface; you have to download the data, e.g.

	nltk.download("wordnet")
	nltk.download("omw-1.4")
	# if you want the wiktionary data
	nltk.download("extended_omw")

Summary of Wordnets

28 Available Wordnets
Wordnet	Lang	Synsets	Words	Senses	Core	Licence
Albanet	als	4,675	5,988	9,599	31%	CC BY 3.0
Arabic WordNet (AWN v2)	arb	9,916	17,785	37,335	47%	CC BY SA 3.0
BulTreeBank Wordnet (BTB-WN)	bul	4,959	6,720	8,936	99%	CC BY 3.0
Chinese Open Wordnet	cmn	42,312	61,533	79,809	100%	wordnet
Chinese Wordnet (Taiwan)	cmn	4,913	3,206	8,069	28%	wordnet
DanNet	dan	4,476	4,468	5,859	81%	wordnet
Greek Wordnet	ell	18,049	18,227	24,106	57%	Apache 2.0
Princeton WordNet	eng	117,659	148,730	206,978	100%	wordnet
Persian Wordnet	fas	17,759	17,560	30,461	41%	Free to use
FinnWordNet	fin	116,763	129,839	189,227	100%	CC BY 3.0
WOLF (Wordnet Libre du Français)	fra	59,091	55,373	102,671	92%	CeCILL-C
Hebrew Wordnet	heb	5,448	5,325	6,872	27%	wordnet
Croation Wordnet	hrv	23,120	29,008	47,900	100%	CC BY 3.0
MultiWordNet	ita	35,001	41,855	63,133	83%	CC BY 3.0
Japanese Wordnet	jpn	57,184	91,964	158,069	95%	wordnet
Multilingual Central Repository	cat	45,826	46,531	70,622	81%	CC BY 3.0
Multilingual Central Repository	eus	29,413	26,240	48,934	71%	CC BY 3.0
Multilingual Central Repository	glg	19,312	23,124	27,138	36%	CC BY 3.0
Multilingual Central Repository	spa	38,512	36,681	57,764	76%	CC BY 3.0
Wordnet Bahasa	ind	38,085	36,954	106,688	94%	MIT
Wordnet Bahasa	zsm	36,911	33,932	105,028	96%	MIT
Norwegian Wordnet	nno	3,671	3,387	4,762	66%	wordnet
Norwegian Wordnet	nob	4,455	4,186	5,586	81%	wordnet
plWordNet	pol	33,826	45,387	52,378	54%	wordnet
OpenWN-PT	por	43,895	54,071	74,012	84%	CC BY SA
sloWNet	slv	42,583	40,233	70,947	86%	CC BY SA 3.0
Swedish (SALDO)	swe	6,796	5,824	6,904	99%	CC BY 3.0
Thai Wordnet	tha	73,350	82,504	95,517	81%	wordnet

Language codes are linked to the English Wikipedia.

Documentation and Notes

Core

Synsets marked with ✪ are in the semi-automatically compiled list of 5000 core word senses in Princeton WordNet (approximately the 5000 most frequently used word senses). They are marked with ✪ in the interface. The original list is here from http://wordnetcode.princeton.edu/standoff-files/core-wordnet.txt (Boyd-Graber et al., 2008). Our version (converted to use collaborative interlingual index).

The wordnets are linked to the Suggested Upper Merged Ontology (Sumo: Niles and Pease, 2001; Pease, 2011); the TempoWordNet (Dias et al., 2014); the Multilingual, layered sentiment lexicons (ML-SentiCon: Cruz et al., 2014); and SentiWordNet3.0 (Baccianella et al., 2010).

Mapping between wordnet versions was done using the mappings from TALP at UPC (Daudé et al. 2000).

Formats

Tab files

The wn-data-*.tab files are tab separated files of synset-lemma pairs; or synset-subid-definition/example

# name␉lang␉url␉license
offset-pos␉lang:lemma␉word
offset-pos␉lang:def␉sid␉definition
offset-pos␉lang:exe␉sid␉example
...

name	the name of the project
lang	the iso 3 letter code for the name
url	the url of the project
license	a short name for the license
offset	the Princeton WordNet 3.0 offset 8 digit offset
pos	one of [a,v,n,r] (we treat 's' as 'a')
lemma	the lemma (word separator normalized to ' ')
sid a	the sub id of the definition/example (starting from 0)

Example:

# Wordnet Bahasa	ind	http://wn-msa.sourceforge.net/	MIT 
00019613-n	ind:def 0 masalah fisik yang nyata
00019613-n	ind:lemma	inti
00019613-n	ind:lemma	unsur
11407591-n	ind:def	0	Novelis dan kritikus Perancis
11407591-n	ind:def	1	pembela Dreyfus
11407591-n	ind:lemma	Emile Zola
11407591-n	ind:lemma	Zola

For this data to be really useful you need to combine it with the synset relations from the Princeton wordnet.

Known Problems

We discard any synsets not linked to PWN (such as new synsets in the Arabic wordnet). Use OMW version2 for these.
If the wordnet has a different structure, we only show those concepts with synonymous or near synonymous links to PWN. So for Danish, Polish and Norwegian, we only have a small subset of the entire wordnet.
We currently only make use of synset level sentiment analysis from ML-SentiCon (Cruz et al., 2014), we do not show the language specific lemma level analysis.
We currently can't add wordnets that don't link to PWN (such as Gaelic).
We are focused on adding lemmas, we do not have all extra information from other projects such as:
- Definitions and examples from wordnets such as Spanish
- Orthographic variation and pronunciation in the Hebrew Wordnet
We plan to add this information as time permits.
We should strip diacritics from the Arabic wordnet to make it easier for lookup.
We may yet be missing some available wordnets: please help us add more. Any wordnet with an open license that links to the Princeton Wordnet is welcome.
The interface is not very multilingual.

Notes

There are some places where I made changes to harmonize different wordnets:

Inter-word separator changed to space ' ' (some wordnets had '_'). Note that non-segmenting languages (e.g. Japanese and Thai have no inter word separator)
Ignore the focal/satellite adjective distinction as not all wordnets make it and we don't show it in the interface. That is, map map all 's' to 'a'

References

Francis Bond and Kyonghee Paik (2012): A survey of wordnets and their licenses In Proceedings of the 6th Global WordNet Conference (GWC 2012). Matsue. 64–71
Francis Bond and Ryan Foster (2013): Linking and extending an open multilingual wordnet. In 51st Annual Meeting of the Association for Computational Linguistics: ACL-2013. Sofia. 1352–1362
Boyd-Graber, J., Fellbaum, C., Osherson, D., and Schapire, R. (2006): core: Adding dense, weighted connections to WordNet. In: Proceedings of the Third Global WordNet Meeting, Jeju Island, Korea, January 2006
Baccianella, Andrea Esuli Stefano and Sebastiani, Fabrizio, (2010): sentiwn: SentiWordNet 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining., Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10) , Valletta, Malta, 2010
Cruz, Fermín L., José A. Troyano, Beatriz Pontes, F. Javier Ortega, (2014): ml-senticon: Building layered, multilingual sentiment lexicons at synset and lemma levels, Expert Systems with Applications , 2014
Jordi Daudé, Lluís Padró and German Rigau (2000): mapp: Mapping WordNets Using Structural Information. 38th Annual Meeting of the Association for Computational Linguistics (ACL'2000), Hong Kong
Adam Pease (2011): sumo: Ontology: A Practical Guide. Articulate Software Press, Angwin, CA. ISBN 978-1-889455-10-5.
Niles, I and Adam Pease (2001): sumo: Toward a Standard Upper Ontology. In Proceedings of the 2nd International Conference on Formal Ontology in Information Systems (FOIS-2001), Chris Welty and Barry Smith, eds.
Gaël Dias, Mohammed Hasanuzzaman, Stéphane Ferrari, Yann Mathet (2014): tempo: TempoWordNet for Sentence Time Tagging . Proceedings of the Companion Publication of the 23rd International Conference on World Wide Web Companion pages 833–838, Switzerland

Maintainer: Francis Bond

Contributors: Francis Bond, Luís Morgado da Costa, Michael Goodman and all the wordnet projects.

Source code hosted at https://github.com/omwn/omwn.github.io.