Farsi Dataset, Please refer to the following article (datas
- Farsi Dataset, Please refer to the following article (dataset's article), if you use dataset: This dataset, named FarsTail, includes 10,367 samples which are provided in both the Persian language as well as the indexed format to be useful for non-Persian researchers. Simply add a row to the corresponding table in Metatext is a platform that allows you to build, train and deploy NLP models in minutes. The codes associated with the collection of this dataset is also available in the Farsi ASR Dataset GitHub repository. Flexible Data Ingestion. Dividing the datasets to training, validation and test sets, and utilizing k-fold cross Farsi ASR Dataset The largest open-source Persian Automatic Speech Recognition (ASR) dataset, collected from various sources. Persian News Dataset Ideal for NLP tasks, sentiment analysis, topic modeling, and more. ai, seeks to fill a significant gap in the realm of Persian speech recognition, specifically concerning production-grade text-to-speech (TTS) models designed specifically for Persian. This repository contains a large-scale dataset with more than 83,000 images of Farsi numbers and letters collected from real-world license plate images captured by various cameras. datasets import PersianTweets from dadmatools. datasets import SnappfoodSentiment from dadmatools. We also built a large-scale synthetic Persian text dataset that can be used for training and evaluating Persian scene text recognition models. Here are our top picks for Persian Language datasets: CC100-Persian Dataset Created in 2020, the CC100-Persian dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. . This repository provides access to a high-quality parallel dataset for English-to-Persian translation. Alongside ManaTTS, we also generated the VirgoolInformal dataset to To prepare this dataset, we captured a diverse collection of in-the-wild images tailored to the unique features of Persian script. Persian Handwritten Digits - GAN-Generated (28x28) - contain 150000 images from dadmatools. This allows you to edit the file in Markdown. Note that all Arabic and Persian numbers are similar. md file for the task (in the Github repository). So we created our own data crawler for collecting data. Here we have compiled a collection of high-quality resources for Persian machine translation. The all-in-one AI library for Persian, supporting a wide variety of tasks and modalities! - hezarai/hezar About Persian (Farsi) Wikipedia Dataset | دیتاست ویکی پدیا فارسی شامل تمامی مقالات فارسی تا تاریخ 12 مرداد 1399 dataset wikipedia-data persian-dataset Readme Activity Awesome Iranian Datasets A collective list of Iranian/Persian Datasets. Persian/Farsi text to speech(TTS) training using coqui tts - karim23657/Persian-tts-coqui This repository contains IDPL-PFOD that is an image dataset of printed Farsi text for Farsi optical character recognition researches. Kaggle is the world’s largest data science community with powerful tools and resources to help you achieve your data science goals. Economic International Political Science Technology Cultural Art Sport Medical In spite of various applications of digit, letter and word recognition, only a few studies have dealt with Persian scripts. pdf. Proposing a complete and large-scale dataset of Persian-English named PESTD at the word level. io Abstract In this study, we introduce ManaTTS, the most extensive publicly accessible single-speaker Persian corpus, and a comprehensive frame- work for collecting transcribed speech datasets for the Persian language. paper aims to present a novel large-scale dataset, IDPL-PFOD2, tailored for Farsi printed text recognition. datasets import Peyma from dadmatools. A comprehensive dataset for determining gender based on Persian names, enriched with English representations. This is a repository for our academic paper which describes the process of building our dataset in detail (our persian guidelines) and provides the ParsFEVER dataset and the related tool. Explore Popular Topics Like Government, Sports, Medicine, Fintech, Food, More. Each form contains 4 Aruco markers and a number of cells which should be filled with persian handwritten numbers and letters. The size of this corpus is 20G, exclusively in the Persian language. A collection of Farsi (Persian) datasets. Usage 🤗 Datasets In this study, we introduce ManaTTS, the most extensive publicly accessible single-speaker Persian corpus, and a comprehensive framework for collecting transcribed speech datasets for the Persian language. Contains persian persian-nlp persian-ocr persian-speech-recognition persian-dataset persian-ai hezar hezarai persian-image-captioning Updated Apr 14, 2025 Python Overview This dataset is a comprehensive collection of Persian speech data, merged from several high-quality sources to facilitate Text-to-Speech (TTS) research and development for the Persian language. A dataset of informal Persian audio and text chunks, along with a fully open processing pipeline, suitable for ASR and TTS tasks. ManaTTS, released under the open CC-0 license, comprises approximately 86 hours of audio with a sampling rate of 44. دیتاستهای متنوع برای آموزش و ارزیابی مدلهای فارسی؛ اعضا میتوانند دیتاستهای خود را به اشتراک بگذارند یا از منابع موجود بهره ببرند. دیتاستهای فارسی و ایرانی - MEgooneh/awesome-Iran-datasets Persian Question Answering (PersianQA) Dataset is a reading comprehension dataset on Persian Wikipedia. To my best knowledge, this is the first publicly available ~30h of clear female voice speech dataset for Persian. The samples are generated from 3,539 multiple-choice questions with the least amount of annotator interventions in a way similar to the SciTail dataset. Download Open Datasets on 1000s of Projects + Share Projects on One Platform. The crowd-sourced dataset consists of more than 9,000 entries. It combines audio recordings with their corresponding transcriptions, providing a rich resource for training and evaluating Persian TTS systems. naab: A ready-to-use plug-and-play corpus in Farsi [If you want to join our community to keep up with news, models and datasets from naab, click on this link. More than 150 million people use GitHub to discover, fork, and contribute to over 420 million projects. Unfortunately, most of the evaluation done for this task is often limited to few domains/datasets. Each entry can be either an impossible-to-answer or a question with one or more answers spanning in the passage (the context) from which the questioner proposed the question. Persian/Farsi text to speech(TTS) training using coqui tts - karim23657/Persian-tts-coqui A Large-scale Dataset of Farsi License Plate Characters This repository contains a large-scale dataset with more than 83,000 images of Farsi numbers and letters collected from real-world license plate images captured by various cameras. This the largest collection of Persian datasets available on Huggingface Iranian/Persian Datasets. This is the largest Persian dataset in this field, which is provided freely so that future researchers can use it. The dataset has been curated for research purposes and is suitable for training and evaluating Neural Machine Translation (NMT) models. This dataset is curated from a substantial real-world sample of more than 10 million records, ensuring reliable and representative data for various applications. Also, the Arabic letters are similar to 28 out of 32 Persian letters. ] Dataset Summary naab is the biggest cleaned and ready-to-use open-source textual corpus in Farsi. Persian tts dataset (female) Dataset is designed for the Persian text-to-speech task. Datasets for Farsi (Persian) Natural Language Processing (NLP) This website aims at listing datasets and tools for research and development in Farsi Natural Language Processing (NLP). This version is a preview of the original 4 million samples dataset (ParsynthOCR-4M). Datasets for Farsi (Persian) Natural Language Processing (NLP) Persian (Farsi) Handwritten Dataset. ~30h of clear female voice Something went wrong and this page crashed! If the issue persists, it's likely a problem on our side. We develop language resources for Persian. By incorporating a subset of the WebGLM dataset, we infused our Persian language model with the ability to provide context-based answers, further enriching its capabilities. Curated List of Persian Natural Language Processing and Information Retrieval Tools and Resources - mhbashari/awesome-persian-nlp-ir GPTInformal Persian is a free licensed Persian dataset of audio and text pairs designed for speech synthesis and other speech-related tasks. GitHub is where people build software. 1 kHz. datasets import PnSummary farstail = FarsTail () #len of dataset print (len (farstail. The datasets are supported by a fully transparent, MIT-licensed pipeline, a testament to innovation in the field. Alongside ManaTTS, we also generated the VirgoolInformal dataset to evaluate Persian speech recognition models used for forced alignment, extending over 5 hours of audio. The words that are placed next to each other are interdependent and represent one subject. ParsynthOCR is a synthetic dataset for Persian OCR. ManaTTS, released under the open CC-0 license, comprises approx- imately 86 hours of audio with a sampling rate of 44. If you would like to add a new dataset (or edit an existing one), you can just click on the small edit button in the top-right corner of the corresponding . The ultimate dataset for training language models on classic Persian poetry. ir Arshasb Persian OCR dataset In this repository, Arshasb (ancient Iranian name [ اَرشاسب ]) Persian OCR dataset is located. The total number of articles is 16,438, spread over eight different classes. دیتاستهای متنوع برای آموزش و ارزیابی مدلهای فارسی؛ اعضا میتوانند دیتاستهای خود را به اشتراک بگذارند یا از منابع موجود بهره ببرند We’re on a journey to advance and democratize artificial intelligence through open source and open science. Contribute to myousefnezhad/persianhandwriting development by creating an account on GitHub. Datasets for Farsi (Persian) Natural Language Processing (NLP) This dataset contains nearly 22k utterances in 15 different domains and 1061 dialogues. About Persian Datasets including: Wikipedia, Twitter, Hamshahri, Hellokish, NSURL'19, Peyma, Text_mining. This dataset is an extension of the previously introduced IDPL-PFOD dataset, offering a substantial increase in both volume and diversity. Over View This project, titled “Open-Source Datasets and Models for Persian Text-to-Speech,” spearheaded by ZabanZad. The dataset has been collected, processed, and annotated as a part of the Mana-TTS project. We’re on a journey to advance and democratize artificial intelligence through open source and open science. It consists of audio files generated using Microsoft Edge's online text-to-speech service and their text extracted from naab textual corpus . This dataset contains 33,000 pages of Persian text, of which 7,000 pages have been published for free. The dataset comprises 2,003,541 images featuring a wide variety of fonts, styles, and sizes. In this paper, deep neural networks are utilized through different DenseNet and Xception architectures, being further boosted by means of data augmentation and test time augmentation. For each image in the dataset, the ground truth information including corresponding image file, image size, number of sub-words, characters dots are also available. It identified that despite its cultural richness and wide usage, Persian language support Common Voice Welcome to the Persian Last Names Dataset, a comprehensive collection of over 100,000 Persian surnames accompanied by their respective frequencies. A Large-scale Dataset of Farsi License Plate Characters This repository contains a large-scale dataset with more than 83,000 images of Farsi numbers and letters collected from real-world license plate images captured by various cameras. Dataset Description This dataset includes Farsi content from a variety of video genres, spanning from older productions up to mid-2024 (all available content with Persian audio and Persian subtitle on Filimo up to this date), such as: Movies TV Series Shows Documentaries Utterances and sentences are extracted based on the timing of subtitles. It contains about 130GB of data, 250 million paragraphs, and 15 billion words. Models Output 🎯 Dataset we used 📁 there are many public datasets for English. datasets import FarsTail from dadmatools. common voice dataset is a rich free dataset. Contribute to mallahyari/Farsi-datasets development by creating an account on GitHub. The dataset contains a set of images, taken from forms that can be found in assets/dataset_form_a5. But for persian there is not enough and free STT dataset. . PESTD includes 5832 instances including letters, digits, and symbols in three categories: Persian, English, and Persian-English. لیستی گردآوری شده از دادگان (دیتاست)های ایرانی و فارسی Categories Persian Locations Literature Health care Social media Governments News Sports Articles Finance Politics Environmental Photos Movies Musics Culture Contributing Persian News A dataset of various news articles scraped from different online news agencies’ websites. train)) Machine Translation Machine Translation of Persian/English is one of the few tasks that has received more work in the past few years. datasets import PerUDT from dadmatools. The SRU/PHN dataset contains 22000 handwritten Farsi names with samples extracted from 280 forms which are filled by 280 writers with different ages, genders and education levels. Created from crawled content on virgool. ali619/corpus-dataset-normalized-for-persian-farsi Viewer • Updated Jun 16 • 385k • 113 • 2 Common Voice by Mozilla offers open datasets for voice recognition research and development, aiming to make technology accessible to diverse global communities. 9brf, 7wrq5, jibms4, 9axhc2, rl16ek, hubmsl, b6q59, uihq, rhik, zgas,