Home
Turkish NLP Resources
Turkish NLP (Türkçe Doğal Dil İşleme) related Tools, Libraries, Models, Datasets and other resources. For Turkish: turkce.ml/trContents:
Tools/Libraries
- ITU Turkish NLP (Web Based & API) : Tools of Istanbul Technical University, Natural Language Processing Group.
- VNLP (Python) : State of the art, lightweight NLP tools for Turkish language.
- TDD - Tools (Web based) : Online tools provided by Turkish Data Depository (TDD) project.
- Zemberek-NLP (Java) : Zemberek-NLP provides Natural Language Processing tools for Turkish.
- Zemberek-Python (Python) : Python implementation of Zemberek.
- Zemberek-Server (Docker) : REST Docker Server on Zemberek Turkish NLP Java Library.
- Mukayese (Python) : is a benchmarking platform for various Turkish NLP tools and tasks, ranging from Spell-checking to NLU tasks.
- SadedeGel (Python) : is initially designed to be a library for unsupervised extraction-based news summarization using several old and new NLP techniques.
- Turkish Stemmer (Python) : Stemmer algorithm for Turkish language.
- sinKAF (Python) : An ML library for profanity detection in Turkish sentences.
- TrTokenizer (Python) : Sentence and word tokenizers for the Turkish language.
- Tools for Turkish NLP provided by Starlang (Multi/Python) : Morphological Analysis, Spell Checker, Dependency Parser, Deasciifier, NER.
- snnclsr/NER (Python) : Named Entity Recognition system for the Turkish Language.
Models
- BERTurk : Turkish BERT/DistilBERT, ELECTRA and ConvBERT models.
- ELMO For ManyLangs : Pre-trained ELMo Representations for Many Languages.
- Fasttext - Word Vector : Pre-trained word vectors for 157 languages, trained on Common Crawl and Wikipedia using fastText.
- Loodos/Turkish Language Models : In this repository, we publish Transformer based Turkish language models and related tools.
- Hugging Face - Models/Turkish
Word Embeddings
- VNLP Word Embeddings : Word2Vec Turkish word embeddings.
- TurkishGloVe : Turkish GloVe word embeddings.
Datasets
- TDD - Türkçe Dil Deposu (Turkish Language Repository) : The Turkish Natural Language Processing Project, one of the main projects of the Turkey Open Source Platform, aims to prepare the datasets needed for the processing of Turkish texts.
- ITU NLP Group - Datasets : Datasets of Istanbul Technical University, Natural Language Processing Group.
- Boğaziçi University TABI - NLI-TR : The Natural Language Inference in Turkish is a set of two large scale datasets that were obtained by translating the foundational NLI corpora (SNLI and MultiNLI) using Amazon Translate.
Multilingual Datasets:
- Amazon MASSIVE : MASSIVE is a parallel dataset of 1M utterances across 51 languages with annotations for the NLU tasks of intent prediction and slot annotation.
- OPUS: en-tr : OPUS is a growing collection of translated texts from the web. In the OPUS project we try to convert and align free online data, to add linguistic annotation, and to provide the community with a publicly available parallel corpus.
- CC-100 : Monolingual Datasets from Web Crawl Data. This corpus comprises of monolingual data for 100+ languages.
- OSCAR : is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the Ungoliant architecture.
Treebank:
- Universal Dependencies : is an international cooperative project to create treebanks of the world's languages. The project seeks to develop cross-linguistically consistent treebank annotation of morphology and syntax for multiple languages.
- UD Turkish Kenet Turkish-Kenet UD Treebank consists of 18,700 manually annotated sentences and 178,700 tokens. Its corpus consists of dictionary examples from TDK.
- UD Turkish BOUN : BOUN Treebank is created by the TABILAB and supported by TÜBİTAK. This corpus contains 9761 sentences, 121,214 tokens.
Other Data:
- hermitdave/Frequency Word List
- Fırat University - Veri Setleri
- Bilkent Turkish Writings Dataset
- 170k Turkish Sentences from Wikipedia
- Wiktionary:Frequency Lists - Turkish
- ooguz/Bad Word Blacklist for Turkish
- ahmetax/Turkish Stop Words List
- NLTK - Stop Words
- Tatoeba: Multilingual Sentences.
- 466k English Words.
Other Sources:
Other Resources
Books:
Videos:
- BOUN - Yapay Öğrenmeye Giriş - İsmail Arı Yaz Okulu 2018
- BOUN - Doğal Dil İşleme - İsmail Arı Yaz Okulu 2018
- BOUN - Konuşma / İşleme - İsmail Arı Yaz Okulu 2018
- BOUN - Yapay Öğrenme Yaz Okulu 2020
- Açık Seminer - NLP 101 Doğal Dil İşlemeye Giriş ve Uygulamalı Metin Madenciliği
- Starlang Yazılım Channel
Articles:
- Türkçe ve Doğal Dil İşleme
- Türkçe Tweetler Üzerinde Otomatik Soru Tespiti
- Classification of News according to Age Groups Using NLP
- Açık Kaynak Doğal Dil İşleme Kütüphaneleri
Sample Notebooks/Snippets:
- kodiks/Turkish News Category Classification Tutorial
- ezgisubasi/Turkish Tweets Sentiment Analysis
- merveenoyan/NLP için Derlediğim Fonksiyonlar
Blog Posts:
Other Lists:
- ITU NLP Group - Tools and Resources : List of various tools and resources for Turkish and Turkics languages.
- Açık Veri Kaynakları : List of Turkey's open data sources. Official Institutions, Municipalities, Universities, International Institutions.
- Awesome Turkish NLP : Yet another Turkish NLP list.
- Türkçe Yapay Zeka Kaynakları : List of AI resources in Turkish.
Contrubuting
Your contributions are welcome. If you want to contribute to this list send a pull request or just open a new issue.