Skip to content

Home

Turkish NLP Resources

Turkish NLP (Türkçe Doğal Dil İşleme) related Tools, Libraries, Models, Datasets and other resources. For Turkish: turkce.ml/tr

Contents:

Tools/Libraries | Models | Datasets | Other Resources


Tools/Libraries

  • ITU Turkish NLP (Web Based & API) : Tools of Istanbul Technical University, Natural Language Processing Group.
  • VNLP (Python) : State of the art, lightweight NLP tools for Turkish language.
  • TDD - Tools (Web based) : Online tools provided by Turkish Data Depository (TDD) project.
  • Zemberek-NLP (Java) : Zemberek-NLP provides Natural Language Processing tools for Turkish.
  • Zemberek-Python (Python) : Python implementation of Zemberek.
  • Zemberek-Server (Docker) : REST Docker Server on Zemberek Turkish NLP Java Library.
  • Mukayese (Python) : is a benchmarking platform for various Turkish NLP tools and tasks, ranging from Spell-checking to NLU tasks.
  • SadedeGel (Python) : is initially designed to be a library for unsupervised extraction-based news summarization using several old and new NLP techniques.
  • Turkish Stemmer (Python) : Stemmer algorithm for Turkish language.
  • sinKAF (Python) : An ML library for profanity detection in Turkish sentences.
  • TrTokenizer (Python) : Sentence and word tokenizers for the Turkish language.
  • Tools for Turkish NLP provided by Starlang (Multi/Python) : Morphological Analysis, Spell Checker, Dependency Parser, Deasciifier, NER.
  • snnclsr/NER (Python) : Named Entity Recognition system for the Turkish Language.

Models

Word Embeddings

Datasets

  • TDD - Türkçe Dil Deposu (Turkish Language Repository) : The Turkish Natural Language Processing Project, one of the main projects of the Turkey Open Source Platform, aims to prepare the datasets needed for the processing of Turkish texts.
  • ITU NLP Group - Datasets : Datasets of Istanbul Technical University, Natural Language Processing Group.
  • Boğaziçi University TABI - NLI-TR : The Natural Language Inference in Turkish is a set of two large scale datasets that were obtained by translating the foundational NLI corpora (SNLI and MultiNLI) using Amazon Translate.

Multilingual Datasets:

  • Amazon MASSIVE : MASSIVE is a parallel dataset of 1M utterances across 51 languages with annotations for the NLU tasks of intent prediction and slot annotation.
  • OPUS: en-tr : OPUS is a growing collection of translated texts from the web. In the OPUS project we try to convert and align free online data, to add linguistic annotation, and to provide the community with a publicly available parallel corpus.
  • CC-100 : Monolingual Datasets from Web Crawl Data. This corpus comprises of monolingual data for 100+ languages.
  • OSCAR : is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the Ungoliant architecture.

Treebank:

  • Universal Dependencies : is an international cooperative project to create treebanks of the world's languages. The project seeks to develop cross-linguistically consistent treebank annotation of morphology and syntax for multiple languages.
  • UD Turkish Kenet Turkish-Kenet UD Treebank consists of 18,700 manually annotated sentences and 178,700 tokens. Its corpus consists of dictionary examples from TDK.
  • UD Turkish BOUN : BOUN Treebank is created by the TABILAB and supported by TÜBİTAK. This corpus contains 9761 sentences, 121,214 tokens.

Other Data:

Other Sources:

Other Resources

Books:

Videos:

Articles:

Sample Notebooks/Snippets:

Blog Posts:

Other Lists:

Contrubuting

Your contributions are welcome. If you want to contribute to this list send a pull request or just open a new issue.