Skip to content

katerysh/wikokit

 
 

Repository files navigation

Language is a city to the building of which every human being brought a stone.

Ralph Waldo Emerson

Wikokit - Machine-readable Wiktionary

Stone I. Parser wikokit. This program parses Wiktionaries, constructs and fills machine-readable Wiktionaries.

Stone II. PHP API (piwidict project) to work with machine-readable Wiktionary.

Stone III. Dictionary kiwidict. A visual interface to the parsed English Wiktionary and Russian Wiktionary databases.

The goal of this project is to extract semi-structured information from Wiktionary and construct machine-readable dictionary (database + API + GUI).

Download new Wiktionary parsed databases from this page.

Stone III: Dictionary kiwidict - Android applications

  • kiwidict offline multilingual dictionary and thesaurus based on the English Wiktionary.
  • kiwidict-ru offline multilingual dictionary and thesaurus based on the Russian Wiktionary.
  • magnetowordik word game based on data extracted from the English Wiktionary.

Graphical user interface (kiwidict and kiwidict-ru) supports (see release_notes.txt):

  • words filtering by language code (e.g. de, fr)
  • wildcard characters: the percent sign (%) matches zero or more characters, and underscore (_) a single character;
  • todo: list of words only with meanings and / or semantic relations (use checkboxes).

After installation you can find the parsed Wiktionary database in SQLite format on your phone in the folder SD card/kiwidict/.

Stone I: Parser and dictionary description

I) The maximum goal (in distant future) is to extract all information (i.e. all sections of entry) from all wiktionaries and convert data to machine-readable format.

II) Today's result. Now machine-readable Wiktionary contains the following information extracted from Russian Wiktionary and English Wiktionary:

  1. word's language and part of speech;
  2. meanings / definitions;
  3. semantic relations;
  4. translations;
  5. (^) context labels (from definitions);
  6. (^) quotations (text + bibliographic data).

(^) Context labels and quotations were extracted only from Russian Wiktionary.

Machine-readable Wiktionary framework: Machine-readable Wiktionary framework

I am interested that all two hundred Wiktionaries were parsed by this parser. But I know only Russian and English :)

If you are developer and if you are interested in adding modules to parse "your Wiktionary", then

Statistics

The machine-readable dictionary database statistics:

Project structure

Wiki tool kit (wikokit) contains several projects related to wiki

./common_wiki — common (low-level) functions to handle data of Wikipedia and Wiktionary in MySQL database,

./common_wiki_jdbc — functions to handle data of Wiktionary in MySQL and SQLite databases (JDBC, Java SE) (depends on common_wiki.jar).

./android/common_wiki_alink — Eclipse copy (source link) of ./common_wiki (!NetBeans)

./android/common_wiki_android — functions for access to Wiktionary in Android SQLite version of database (depends on common_wiki.jar).

./android/magnetowordik — Android word game (Wiktionary thesaurus).

./hits_wiki — API for access to Wikipedia in MySQL database, algorithms to search synonyms in Wikipedia (depends on jcfd.jar, common_wiki.jar).

./TGWikiBrowser — visual browser to search for synonyms in local or remote Wikipedia (depends on hits_wiki.jar and common_wiki.jar)

./wikidf — Wiki Index Database (list of lemmas and links to wiki pages, which contain these lemmas).

./wikt_parser — Wiktionary parser creates a MySQL database (like WordNet) from an Wiktionary MySQL dump file. The project goal is to convert Wiktionary articles to machine-readable format. (It depends on common_wiki, common_wiki_jdbc)

./wiwordik — Visualization of parsed Wiktionary database. wiki + word = wiwordik.

The code of previous project Synarcher are used in wikokit.

Further reading

In English

In Russian

See also

License

This program is multi-licensed and may be used under the terms of any of the following licenses:

See documentation.

About

Machine-readable Wiktionary

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Java 97.4%
  • HTML 1.5%
  • PHP 0.4%
  • Perl 0.3%
  • SQLPL 0.3%
  • Batchfile 0.1%