Skip to content

Latest commit

 

History

History
1041 lines (1028 loc) · 142 KB

README.md

File metadata and controls

1041 lines (1028 loc) · 142 KB

Corpus Crawler

Corpus Crawler is a tool for Corpus Linguistics.

Modern linguistic research works on language corpora, which are large samples of “real world” text. This crawler helps to build such corpora: it follows links to publicly accessible web pages known to be written in a certain language; it removes boilerplate and HTML markup; finally, it writes its output into plaintext files. The crawler implements the Robots Exclusion Standard, and it is intentionally slow so it does not cause much load on the crawled web sites.

This is not an official Google product. But if you’re a linguistic researcher, or if you’re writing a spell checker (or similar language-processing software) for an “exotic” language, you might find Corpus Crawler useful.

To build corpora for not-yet-supported languages, please read the contribution guidelines and send us GitHub pull requests.

The crawled corpora have been used to compute word frequencies in Unicode’s Unilex project.

Supported Languages

IETF BCP47 Code Language Tokens¹
aai Arifama-Miniafia 181K 💾
aak Ankave 194K 💾
aau Abau 313K 💾
aaz Amarasi 308K 💾
abt Ambulas 297K 💾
aby Aneme Wake 233K 💾
acd Gikyode 323K 💾
ace Aceh/Acehnese 817K 💾
acf Saint Lucian Creole French 236K 💾
ach Acoli 178K 💾
acn Achang 232K 💾
acr Achi 239K 💾
acu Achuar-Shiwiar 174K 💾
ade Adele 267K 💾
adh Adhola 166K 💾
adj Adioukrou 233K 💾
ae Avestan 129K 💾
ae-Latn Avestan (Latin) 141K 💾
aey Amele 218K 💾
agd Agarabi 256K 💾
agg Angor 214K 💾
agm Angaataha 238K 💾
agn Agutaynen 234K 💾
agr Aguaruna 149K 💾
ahk Akha 367K 💾
aia Arosi 223K 💾
akb Batak Angkola 220K 💾
ake Akawaio 190K 💾
akh Akha 408K 💾
akp Siwu 191K 💾
alj Alangan 185K 💾
alp Alune 225K 💾
alt Southern Altai 121K 💾
alz Alur 160K 💾
am Amharic 2,170K 💾
ame Yanesha' 221K 💾
amf Hamer-Banna 152K 💾
amk Ambai 229K 💾
amm Ama (Papua New Guinea) 246K 💾
amn Amanab 207K 💾
amp Alamblak 241K 💾
amr Amarakaeri 151K 💾
amu Guerrero Amuzgo 202K 💾
ann Obolo 236K 💾
anv Denya 214K 💾
aoj Mufian 217K 💾
aom Ömie 231K 💾
aon Bumbita Arapesh 294K 💾
aoz Uab Meto 197K 💾
ape Bukiyip 294K 💾
apr Arop-Lokep 373K 💾
apz Safeyoka 235K 💾
ar Arabic 19,593K 💾
arl Arabela 206K 💾
asg Cishingini 270K 💾
aso Dano 290K 💾
ata Pele-Ata 248K 💾
atb Zaiwa 291K 💾
atg Ivbie North-Okpela-Arhe 229K 💾
atq Aralle-Tabulahan 202K 💾
auy Awiyaana 164K 💾
av Avaric 111K 💾
avn Avatime 229K 💾
avt Au 263K 💾
avu Avokaya 391K 💾
awa Awadhi 211K 💾
awb Awa (Papua New Guinea) 179K 💾
ay Aymara 482K 💾
ayo Ayoreo 264K 💾
az Azerbaijani 3,413K 💾
azg San Pedro Amuzgos Amuzgo 271K 💾
azz Highland Puebla Nahuatl 265K 💾
ba Bashkir 666K 💾
ban Balinese 211K 💾
bao Waimaha 232K 💾
bav Vengo 250K 💾
bba Baatonum 792K 💾
bbb Barai 289K 💾
bbo Northern Bobo Madaré 211K 💾
bbr Girawa 245K 💾
bch Bariai 248K 💾
bcw Bana 304K 💾
bdd Bunama 171K 💾
be Belarusian 1,441K 💾
be-tarask Belarusian (Taraškievica) 108,431K 💾
bef Benabena 239K 💾
bep Besoa 204K 💾
bex Jur Modo 254K 💾
bfd Bafut 276K 💾
bfo Malba Birifor 260K 💾
bg Bulgarian 10,597K 💾
bgr Bawm Chin 213K 💾
bgz Banggai 186K 💾
bhl Bimin 324K 💾
bhw Biak 164K 💾
bi Bislama 315K 💾
bib Bissa 243K 💾
big Biangai 229K 💾
bik Central Bikol 183K 💾
bim Bimoba 215K 💾
biv Southern Birifor 221K 💾
bjr Binumarien 226K 💾
bjv Bedjond 268K 💾
bkl Berik 306K 💾
bku Buhid 204K 💾
bkv Bekwarra 244K 💾
blh Kuwaa 259K 💾
blt-Latn Tai Dam (Latin) 262K 💾
blz Balantak 199K 💾
bm Bambara 30K 💾
bmh Kein 253K 💾
bmq Bomu 207K 💾
bmr Muinane 122K 💾
bmu Somba-Siawari 234K 💾
bmv Bum 258K 💾
bn Bangla 7,258K 💾
bnj Eastern Tawbuid 239K 💾
bnp Bola 263K 💾
bo Tibetan 5,642K 💾
boa Bora 133K 💾
boj Anjam 255K 💾
bon Bine 244K 💾
bov Tuwuli 203K 💾
box Buamu 274K 💾
bpr Koronadal Blaan 204K 💾
bps Sarangani Blaan 214K 💾
bqc Boko 567K 💾
bqj Bandial 175K 💾
bqp Busa 162K 💾
bru Eastern Bru 261K 💾
bs Bosnian 8,993K 💾
bsn Barasana-Eduria 225K 💾
bss Akoose 199K 💾
btd Batak Dairi 192K 💾
bts Batak Simalungun 175K 💾
btt Bete-Bendi 266K 💾
btx Batak Karo 189K 💾
bua Buriat 143K 💾
bud Ntcham 207K 💾
buk Bugawac 264K 💾
bus Bokobaru 159K 💾
bvc Baelelea 308K 💾
bvz Bauzi 509K 💾
bwq Southern Bobo Madaré 214K 💾
bwu Buli 285K 💾
byr Baruya 182K 💾
byx Qaqet 387K 💾
bzh Mapos Buang 251K 💾
bzi Bisu 381K 💾
bzj Belize Kriol English 240K 💾
ca-valencia Valencian 24,295K 💾
caa Chortí 307K 💾
cab Garifuna 154K 💾
cac Chuj 244K 💾
cak Kaqchikel 259K 💾
cap Chipaya 154K 💾
car Galibi Carib 160K 💾
cax Chiquitano 149K 💾
cbc Carapana 256K 💾
cbi Chachi 187K 💾
cbl Bualkhaw Chin 210K 💾
cbr Cashibo-Cacataibo 236K 💾
cbs Cashinahua 198K 💾
cbt Chayahuita 150K 💾
cbv Cacua 265K 💾
cce Chopi 204K 💾
ccp Chakma 79K 💾
cdf Chiru 193K 💾
ce Chechen 669K 💾
ceb Cebuano 1,067K 💾
ceg Chamacoco 232K 💾
cfm Falam Chin 438K 💾
cgc Kagayanen 299K 💾
chj Ojitlán Chinantec 305K 💾
chm Mari 132K 💾
chr Cherokee 119K 💾
chz Ozumacín Chinantec 205K 💾
cjo Ashéninka Pajonal 141K 💾
cjp Cabécar 199K 💾
cjv Chuave 286K 💾
cko Anufo 272K 💾
cle Lealao Chinantec 313K 💾
cme Cerma 230K 💾
cmr Mro-Khimi Chin 275K 💾
cnh Hakha Chin 934K 💾
cni Asháninka 122K 💾
cnk Khumi Chin 237K 💾
cnl Lalana Chinantec 308K 💾
cnt Tepetotutla Chinantec 261K 💾
coe Koreguaje 181K 💾
cof Colorado 183K 💾
cok Santa Teresa Cora 230K 💾
con Cofán 151K 💾
cot Caquinte 128K 💾
crh Crimean Tatar 505K 💾
cs Czech 3,141K 💾
csk Jola-Kasa 177K 💾
cso Sochiapam Chinantec 328K 💾
ctd-Latn Tedim Chin (Latin) 852K 💾
ctu Chol 203K 💾
cub Cubeo 220K 💾
cuc Usila Chinantec 278K 💾
cui Cuiba 292K 💾
cuk San Blas Kuna 187K 💾
cul Culina 221K 💾
cv Chuvash 111K 💾
cwe Kwere 144K 💾
cwt Kuwaataay 168K 💾
cy Welsh 11,519K 💾
cya Nopala Chatino 245K 💾
czt Zotung Chin 227K 💾
da Danish 655K 💾
daa Dangaléat 208K 💾
dad Marik 197K 💾
dah Gwahatike 274K 💾
ddn Dendi 210K 💾
de German 46,431K 💾
ded Dedua 146K 💾
des Desano 210K 💾
dga Southern Dagaare 458K 💾
dgi Northern Dagara 257K 💾
dgz Daga 219K 💾
din Southwestern Dinka 196K 💾
dip Northeastern Dinka 193K 💾
djk Eastern Maroon Creole 307K 💾
dln Darlong 776K 💾
dnw Western Dani 254K 💾
dob Dobu 179K 💾
dop Lukpa 226K 💾
dsh Daasanach 211K 💾
dtb Labuk-Kinabatangan Kadazan 248K 💾
dtp Kadazan Dusun 1,038K 💾
dts Toro So Dogon 202K 💾
due Umiray Dumaget Agta 247K 💾
dug Duruma 172K 💾
duo Dupaninan Agta 266K 💾
dwr Dawro 254K 💾
dww Dawawa 208K 💾
dyi Djimini Senoufo 268K 💾
dyo Jola-Fonyi 158K 💾
dyu Dyula 1,156K 💾
dz Dzongkha 61K 💾
ee Ewe 421K 💾
eka Ekajuk 213K 💾
el Greek 5,470K 💾
emi Mussau-Emira 176K 💾
emp Northern Emberá 158K 💾
enb Markweeta 147K 💾
enq Enga 217K 💾
enx Enxet 772K 💾
eri Ogea 269K 💾
es Spanish 32,670K 💾
ese Ese Ejja 226K 💾
et Estonian 3,658K 💾
eu Basque 130K 💾
ewo Ewondo 158K 💾
eza Ezaa 963K 💾
fa Persian 9,114K 💾
fa-AF Dari 7,363K 💾
faa Fasu 238K 💾
fai Faiwol 256K 💾
fal South Fali 198K 💾
far Fataleka 286K 💾
fi Finnish 4,837K 💾
fil Tagalog 184K 💾
fip Fipa 134K 💾
fit Tornedalen Finnish 292K 💾
fj Fijian 257K 💾
fo Faroese 851K 💾
fon Fon 266K 💾
for Fore 169K 💾
fr French 5,488K 💾
fue Borgu Fulfulde 148K 💾
fuf Pular 174K 💾
fuq Central-Eastern Niger Fulfulde 156K 💾
fuv Nigerian Fulfulde 13K 💾
ga Irish 7,587K 💾
gag Gagauz 245K 💾
gah Alekano 210K 💾
gam Kandawo 250K 💾
gaw Nobonob 246K 💾
gbi Galela 288K 💾
gd Scottish Gaelic 17,105K 💾
gde Gude 217K 💾
gdn Umanakaina 306K 💾
gdr Wipi 271K 💾
gej Gen 236K 💾
gfk Patpatar 294K 💾
ghs Guhu-Samane 186K 💾
gil Gilbertese 228K 💾
gkn Gokana 267K 💾
gmv-Latn Gamo (Latin) 127K 💾
gn Guarani 142K 💾
gnd Zulgo-Gemzek 364K 💾
gng Ngangam 219K 💾
gnw Western Bolivian Guaraní 263K 💾
gof Gofa 124K 💾
gog Gogo 173K 💾
gor Gorontalo 211K 💾
gqr Gor 218K 💾
grb Northern Grebo 270K 💾
grt Garo 141K 💾
gso Southwest Gbaya 228K 💾
gsw-u-sd-chag Swiss German (Aargau) 99K 💾
gsw-u-sd-chbe Swiss German (Bern) 73K 💾
gsw-u-sd-chfr Swiss German (Fribourg) 42K 💾
gu Gujarati 702K 💾
gub Guajajára 997K 💾
guc Wayuu 211K 💾
gud Yocoboué Dida 216K 💾
guh Guahibo 204K 💾
gui Eastern Bolivian Guaraní 197K 💾
gum Guambiano 186K 💾
gun Mbyá Guaraní 176K 💾
guo Guayabero 203K 💾
guq Aché 184K 💾
gur Farefare 240K 💾
gux Gourmanchéma 215K 💾
gv Manx Gaelic 152K 💾
gvc Guanano 241K 💾
gvf Golin 276K 💾
gvl Gulay 270K 💾
gwr Gwere 157K 💾
gym Ngäbere 294K 💾
gyr Guarayu 176K 💾
ha Hausa 1,775K 💾
hae Eastern Oromo 163K 💾
hag Hanga 202K 💾
haw Hawaiian 2,221K 💾
hay Haya 112K 💾
heh Hehe 136K 💾
hi Hindi 10,004K 💾
hif Fiji Hindi 204K 💾
hig Kamwe 261K 💾
hil Hiligaynon 208K 💾
hla Halia 273K 💾
hne Chhattisgarhi 207K 💾
hnn Hanunoo 212K 💾
hns Caribbean Hindustani 312K 💾
ho Hiri Motu 240K 💾
hot Hote 222K 💾
hr Croatian 8,188K 💾
ht Haitian 1,101K 💾
hto Minica Huitoto 182K 💾
hu Hungarian 600K 💾
hub Huambisa 160K 💾
hui Huli 232K 💾
hus Huastec 236K 💾
huu Murui Huitoto 165K 💾
huv San Mateo Del Mar Huave 197K 💾
hvn Sabu 312K 💾
hy Armenian 25,972K 💾
ian Iatmul 224K 💾
iba Iban 179K 💾
icr Islander Creole English 248K 💾
id Indonesian 6,634K 💾
ifa Amganad Ifugao 810K 💾
ifb Batad Ifugao 835K 💾
ife Ifè 300K 💾
ifk Tuwali Ifugao 214K 💾
ifu Mayoyao Ifugao 258K 💾
ify Keley-I Kallahan 863K 💾
ig Igbo 13K 💾
ign Ignaciano 161K 💾
ik Inupiaq 96K 💾
ilo Iloko 169K 💾
imo Imbongu 280K 💾
inb Inga 151K 💾
ino Inoke-Yate 236K 💾
iou Tuma-Irumu 225K 💾
ipi Ipili 312K 💾
iri Irigwe 243K 💾
irk Iraqw 184K 💾
iry Iraya 205K 💾
it Italian 13,569K 💾
itv Itawit 242K 💾
iu Inuktitut 98K 💾
iws Sepik Iwam 307K 💾
izr Izere 216K 💾
izz Izii 908K 💾
ja Japanese 2,116K 💾
jac Popti' 221K 💾
jae Yabem 186K 💾
jam Jamaican Creole English 254K 💾
jbu Jukun Takum 264K 💾
jic Tol 285K 💾
jiv Shuar 134K 💾
jmc Machame 150K 💾
jun Juang 178K 💾
jv Javanese 177K 💾
jvn Caribbean Javanese 211K 💾
ka Georgian 4,978K 💾
kaa Kara-Kalpak 135K 💾
kab-Arab Kabyle (Arabic) 715K 💾
kab-Tfng Kabyle (Tifinagh) 1,338K 💾
kab Kabyle 66K 💾
kac Kachin 1,057K 💾
kao Xaasongaxango 205K 💾
kaq Capanahua 164K 💾
kbh Camsá 193K 💾
kbm Iwal 298K 💾
kbp Kabiyè 571K 💾
kbq Kamano 156K 💾
kbr Kafa 147K 💾
kcg Tyap 279K 💾
kdc Kutu 140K 💾
kdi Kumam 195K 💾
kdj Karamojong 163K 💾
kdn Kunda 144K 💾
kek Kekchí 406K 💾
ken Kenyang 200K 💾
keo Kakwa 215K 💾
ker Kera 267K 💾
kew West Kewa 247K 💾
kez Kukele 173K 💾
kgf Kube 175K 💾
kgr Abun 356K 💾
khz Keapara 196K 💾
kia Kim 525K 💾
kij Kilivila 155K 💾
kj Kuanyama 1,474K 💾
kjb Q'anjob'al 263K 💾
kje Kisar 235K 💾
kjh Khakas 128K 💾
kjs East Kewa 251K 💾
kk Kazakh 642K 💾
kki Kagulu 125K 💾
kkj Kako 263K 💾
kln Kalenjin 149K 💾
km Khmer 29,110K 💾
kma Konni 230K 💾
kmg Kâte 127K 💾
kmo Kwoma 213K 💾
kms Kamasau 293K 💾
kmu Kanite 214K 💾
kn Kannada 126K 💾
kne Kankanaey 230K 💾
knf Mankanya 164K 💾
knj Western Kanjobal 1,350K 💾
knk Kuranko 228K 💾
kno Kono 360K 💾
knv Tabo 243K 💾
kog Cogui 189K 💾
kpf Komba 174K 💾
kpg Kapingamarangi 967K 💾
kpr Korafe-Yegha 262K 💾
kpw Kobon 288K 💾
kpx Mountain Koiali 190K 💾
kpz Kupsabiny 166K 💾
kqc Doromu-Koki 209K 💾
kqe Kalagan 241K 💾
kqp Kimré 254K 💾
kqw Kandas 201K 💾
kqy Koorete 156K 💾
krc Karachay-Balkar 132K 💾
kri Krio 256K 💾
krj Kinaray-A 228K 💾
kru Kurukh 182K 💾
ksd Kuanua 228K 💾
ksr Borong 233K 💾
ktb Kambaata 113K 💾
ktj Plapo Krumen 356K 💾
kto Kuot 286K 💾
ku Kurdish 2,479K 💾
kub Kutep 281K 💾
kud ‘Auhelawa 167K 💾
kue Kuman (Papua New Guinea) 230K 💾
kum Kumyk 142K 💾
kup Kunimaipa 279K 💾
kus Kusaal 200K 💾
kv Komi 122K 💾
kvn Border Kuna 212K 💾
kwf Kwara'ae 296K 💾
kwi Awa-Cuaiquer 165K 💾
kwj Kwanga 290K 💾
kxc Konso 148K 💾
kxm Northern Khmer 257K 💾
ky Kyrgyz 18,597K 💾
kyc Kyaka 220K 💾
kyf Kouya 215K 💾
kyg Keyagana 190K 💾
kyq Kenga 250K 💾
kyu Western Kayah 466K 💾
kyz Kayabí 324K 💾
kze Kosena 164K 💾
kzf Da'a Kaili 213K 💾
kzj Coastal Kadazan 215K 💾
la Latin 48K 💾
laj Lango 175K 💾
las Lama 235K 💾
law Lauje 262K 💾
lb Luxembourgish 5,173K 💾
lcm Tungag 239K 💾
lee Lyélé 257K 💾
lef Lelemi 211K 💾
lem Nomaande 249K 💾
leu Kara (Papua New Guinea) 255K 💾
lew Ledo Kaili 198K 💾
lex Luang 271K 💾
lgg Lugbara 188K 💾
lhu Lahu 352K 💾
lia West-Central Limba 247K 💾
lid Nyindrou 308K 💾
lif Limbu 138K 💾
lip Sekpele 214K 💾
lis Lisu 304K 💾
ljp Lampung Api 188K 💾
lln Lele 291K 💾
lme Pévé 245K 💾
lmk Lamkang 217K 💾
lnd Lundayeh 670K 💾
lo Lao 4,384K 💾
lob Lobi 192K 💾
loe Saluan 220K 💾
lok Loko 264K 💾
lon Malawi Lomwe 137K 💾
lsi Lashi 1,077K 💾
lsm Saamia 156K 💾
lt Lithuanian 39,575K 💾
luc Aringa 242K 💾
lus Lushai 204K 💾
lv Latvian 1,020K 💾
lwo Luwo 255K 💾
maa San Jerónimo Tecóatl Mazatec 487K 💾
mad Madurese 706K 💾
mag Magahi 193K 💾
mai Maithili 211K 💾
maj Jalapa De Díaz Mazatec 188K 💾
mak Makasar 179K 💾
mam Mam 834K 💾
maw Mampruli 251K 💾
maz Central Mazahua 286K 💾
mbb Western Bukidnon Manobo 278K 💾
mbc Macushi 221K 💾
mbh Mangseng 321K 💾
mbt Matigsalug Manobo 226K 💾
mca Maca 208K 💾
mcb Machiguenga 132K 💾
mcd Sharanahua 200K 💾
mco Coatlán Mixe 217K 💾
mcp Makaa 237K 💾
mcq Ese 158K 💾
mcu Cameroon Mambila 260K 💾
mda Mada 312K 💾
mdy Male 589K 💾
med Melpa 283K 💾
mee Mengen 301K 💾
mej Meyah 323K 💾
mek Mekeo 234K 💾
men Mende 210K 💾
meq Merey 291K 💾
meu Motu 175K 💾
mfe Morisyen 172K 💾
mfh Matal 238K 💾
mfi Wandala 265K 💾
mfk North Mofu 248K 💾
mfq Moba 232K 💾
mfy Mayo 167K 💾
mfz Mabaan 237K 💾
mg Malagasy 1,623K 💾
mgd Moru 192K 💾
mgh Makhuwa-Meetto 150K 💾
mgo Meta' 251K 💾
mh Marshallese 750K 💾
mhi Ma'di 192K 💾
mhl Mauwake 235K 💾
mhx Maru 291K 💾
mhy Ma'anyan 190K 💾
mi Maori 1,504K 💾
mib Atatláhuca Mixtec 263K 💾
mif Mofu-Gudur 283K 💾
mil Peñoles Mixtec 365K 💾
min Minangkabau 242K 💾
mio Pinotepa Nacional Mixtec 288K 💾
miq Mískito 214K 💾
mit Southern Puebla Mixtec 273K 💾
mk Macedonian 10,422K 💾
mkl Mokole 230K 💾
ml Malayalam 118K 💾
mlh Mape 235K 💾
mlp Bargam 297K 💾
mmo Mangga Buang 269K 💾
mmx Madak 271K 💾
mna Mbula 257K 💾
mnb Muna 151K 💾
mnf Mundani 241K 💾
mnw Mon 1,836K 💾
moa Mwan 308K 💾
mog Mongondow 220K 💾
mop Mopán Maya 296K 💾
mor Moro 152K 💾
mox Molima 222K 💾
mpg Marba 210K 💾
mpm Yosondúa Mixtec 336K 💾
mps Dadibi 1,270K 💾
mpt Mian 256K 💾
mpx Misima-Panaeati 227K 💾
mqb Mbuko 302K 💾
mqj Mamasa 164K 💾
mqn Moronene 164K 💾
mr Marathi 16,594K 💾
mrw Maranao 912K 💾
ms Malay 659K 💾
msm Agusan Manobo 225K 💾
msy Aruamu 229K 💾
mt Maltese 3,331K 💾
mta Cotabato Manobo 262K 💾
mti Maiwa (Papua New Guinea) 166K 💾
mtj Moskona 321K 💾
mto Totontepec Mixe 233K 💾
mtp Wichí Lhamtés Nocten 183K 💾
muh Mündü 392K 💾
mur Murle 210K 💾
mux Bo-Ung 363K 💾
muy Muyang 265K 💾
mva Manam 231K 💾
mvp Duri 174K 💾
mwv Mentawai 141K 💾
mxb Tezoatlán Mixtec 281K 💾
mxt Jamiltepec Mixtec 267K 💾
my Burmese 1,007K 💾
my-t-d0-zawgyi Burmese (Zawgyi encoding) 593K 💾
myb Mbay 192K 💾
myk Mamara Senoufo 272K 💾
myv Erzya 143K 💾
myw Muyuw 150K 💾
myx Masaaba 164K 💾
myy Macuna 245K 💾
mza Santa María Zacatepec Mixtec 316K 💾
mzi Ixcatlán Mazatec 190K 💾
mzk Nigeria Mambila 283K 💾
mzm Mumuye 265K 💾
naf Nabak 220K 💾
nak Nakanai 333K 💾
nan-Latn Min Nan Chinese (Latin) 231K 💾
nas Naasioi 168K 💾
nca Iyo 203K 💾
nch Central Huasteca Nahuatl 195K 💾
ncj Northern Puebla Nahuatl 164K 💾
ncu Chumburung 312K 💾
ndj Ndamba 141K 💾
ndy Lutos 216K 💾
ndz Ndogo 350K 💾
neb Toura 326K 💾
new Newari 150K 💾
nfr Nafaanra 233K 💾
ngp Ngulu 149K 💾
nho Takuu 309K 💾
nhu Noone 270K 💾
nhw Western Huasteca Nahuatl 194K 💾
nhy Northern Oaxaca Nahuatl 185K 💾
nia Nias 182K 💾
nii Nii 316K 💾
nij Ngaju 194K 💾
nim Nilamba 117K 💾
nin Ninzo 267K 💾
nkf Inpui Naga 197K 💾
nko Nkonya 168K 💾
nl Dutch 58,357K 💾
nlc Nalca 241K 💾
nmz Nawdm 209K 💾
nnb Nande 127K 💾
nnq Ngindo 137K 💾
nnw Southern Nuni 291K 💾
noa Woun Meu 275K 💾
nog Nogai 104K 💾
nop Numanggang 183K 💾
not Nomatsiguenga 141K 💾
nou Ewage-Notu 266K 💾
npl Southeastern Puebla Nahuatl 148K 💾
npy Napu 192K 💾
nsn Nehan 248K 💾
nsu Sierra Negra Nahuatl 170K 💾
ntm Nateni 229K 💾
ntp Northern Tepehuan 173K 💾
ntr Delo 272K 💾
nuj Nyole 151K 💾
nus Nuer 195K 💾
nvm Namiae 290K 💾
nwb Nyabwa 316K 💾
nwi Southwest Tanna 230K 💾
ny Nyanja 356K 💾
nyf Giryama 169K 💾
nyn Nyankole 120K 💾
nyo Nyoro 120K 💾
nyy Nyakyusa-Ngonde 138K 💾
nzi Nzima 201K 💾
obo Obo Manobo 266K 💾
oc Occitan 2,706K 💾
oku Oku 239K 💾
okv Orokaiva 212K 💾
old Mochi 151K 💾
ong Olo 284K 💾
opm Oksapmin 332K 💾
or Oriya 175K 💾
os Ossetic 135K 💾
osa Osage 3K 💾
otd Ot Danum 187K 💾
ote Mezquital Otomi 251K 💾
ozm Koonzime 267K 💾
pa Punjabi 59,990K 💾
pab Parecís 156K 💾
pad Paumarí 242K 💾
pag Pangasinan 177K 💾
pah Tenharim 268K 💾
pam Pampanga 196K 💾
pau Palauan 255K 💾
pbc Patamona 181K 💾
pbi Parkwa 272K 💾
pck Paite Chin 770K 💾
pcm Nigerian Pidgin 315K 💾
pez Eastern Penan 235K 💾
pib Yine 114K 💾
pir Piratapuyo 229K 💾
pis Pijin 263K 💾
pjt Pitjantjatjara 237K 💾
pkb Pokomo 166K 💾
pl Polish 7,148K 💾
plw Brooke's Point Palawano 203K 💾
pmf Pamona 307K 💾
pny Pinyin 247K 💾
poh Poqomchi' 266K 💾
poi Highland Popoluca 179K 💾
poy Pogolo 147K 💾
ppk Uma 220K 💾
ppo Folopa 258K 💾
prf Paranan 203K 💾
prk Parauk 1,026K 💾
ps Pashto 7,343K 💾
pss Kaulong 326K 💾
pt Portuguese 20,891K 💾
pt-PT Portuguese (Portugal) 666K 💾
ptp Patep 294K 💾
ptu Bambam 194K 💾
pwg Gapapaiwa 208K 💾
pww Pwo Northern Karen 345K 💾
pxm Quetzaltepec Mixé 720K 💾
qu Quechua 580K 💾
qub Huallaga Huánuco Quechua 122K 💾
quc K'iche' 207K 💾
quf Lambayeque Quechua 161K 💾
quh South Bolivian Quechua 623K 💾
qul North Bolivian Quechua 140K 💾
qup Southern Pastaza Quechua 177K 💾
quw Tena Lowland Quichua 116K 💾
quy Ayacucho Quechua 106K 💾
qvc Cajamarca Quechua 166K 💾
qve Eastern Apurímac Quechua 168K 💾
qvi Imbabura Highland Quichua 146K 💾
qvm Margos-Yarowilca-Lauricocha Quechua 132K 💾
qvn North Junín Quechua 139K 💾
qvo Napo Lowland Quechua 117K 💾
qvs San Martín Quechua 153K 💾
qvw Huaylla Wanca Quechua 111K 💾
qvz Northern Pastaza Quichua 157K 💾
qwh Huaylas Ancash Quechua 128K 💾
qxh Panao Huánuco Quechua 123K 💾
qxl Salasaca Highland Quichua 127K 💾
qxn Northern Conchucos Ancash Quechua 150K 💾
qxo Southern Conchucos Ancash Quechua 136K 💾
qxr Cañar Highland Quichua 509K 💾
rai Ramoaaina 273K 💾
raj Malvi 198K 💾
rav Sampang 138K 💾
rej Rejang 178K 💾
rim Nyaturu 151K 💾
rm-puter Romansh (Puter) 1,068K 💾
rm-rumgr Romansh (Grischun) 4,794K 💾
rm-surmiran Romansh (Surmiran) 2,540K 💾
rm-sursilv Romansh (Sursilvan) 11,678K 💾
rm-sutsilv Romansh (Sutsilvan) 1,007K 💾
rm-vallader Romansh (Vallader) 5,560K 💾
rmc Carpathian Romani 170K 💾
rmo Sinte Romani 228K 💾
rn Rundi 120K 💾
rnl Ranglong 221K 💾
ro Romanian 13,962K 💾
ro-MD Moldavian 2,694K 💾
rom Vlax Romani 186K 💾
roo Rotokas 292K 💾
rro Waima 177K 💾
ru Russian 40,987K 💾
ruf Luguru 135K 💾
rug Roviana 956K 💾
rw Kinyarwanda 605K 💾
rwo Rawa 261K 💾
sab Buglere 405K 💾
sah Sakha 2,457K 💾
sas Sasak 196K 💾
sat Santali 149K 💾
sba Ngambay 246K 💾
sbl Botolan Sambal 251K 💾
sck Sadri 189K 💾
sda Toraja-Sa'dan 154K 💾
seh Sena 155K 💾
sey Secoya 163K 💾
sg Sango 265K 💾
sgb Mag-antsi Ayta 233K 💾
sgw Sebat Bet Gurage 116K 💾
sgz Sursurunga 327K 💾
shk Shilluk 189K 💾
shn Shan 1,435K 💾
shp Shipibo-Conibo 169K 💾
si Sinhala 1,046K 💾
sig Paasaal 277K 💾
sil Tumulung Sisaala 256K 💾
sim Mende (Papua New Guinea) 273K 💾
sja Epena 194K 💾
sk Slovak 70,933K 💾
sl Slovenian 10,975K 💾
sld Sissala 206K 💾
sll Salt-Yui 264K 💾
sm Samoan 248K 💾
smt Simte 177K 💾
sn Shona 2,542K 💾
snc Sinaugoro 216K 💾
snn Siona 222K 💾
snp Siane 237K 💾
snw Selee 212K 💾
sny Saniyo-Hiyewe 348K 💾
so Somali 874K 💾
soq Kanasi 213K 💾
soy Miyobe 205K 💾
spl Selepet 244K 💾
spp Supyire Senoufo 251K 💾
sps Saposa 324K 💾
sq Albanian 10,104K 💾
sr Serbian 4,785K 💾
sr-Latn Serbian (Latin) 10,143K 💾
sri Siriano 166K 💾
srm Saramaccan 369K 💾
srn Sranan Tongo 232K 💾
ssd Siroi 210K 💾
ssg Seimat 221K 💾
ssx Samberigi 233K 💾
stn Owa 263K 💾
su Sundanese 172K 💾
sua Sulka 458K 💾
sue Suena 227K 💾
sur Mwaghavul 261K 💾
sus Susu 205K 💾
suz Sunwar 732K 💾
sv Swedish 33,633K 💾
sw Swahili 8,817K 💾
swp Suau 175K 💾
sxn Sangir 209K 💾
ta Tamil 1,413K 💾
tab Tabassaran 132K 💾
taj Eastern Tamang 169K 💾
tap Taabwa 145K 💾
taq Tamasheq 218K 💾
tav Tatuyo 256K 💾
taw Tai 268K 💾
tbc Takia 278K 💾
tbg North Tairora 235K 💾
tbo Tawala 198K 💾
tby Tabaru 226K 💾
tbz Ditammari 692K 💾
tca Ticuna 251K 💾
tcc Datooga 135K 💾
te Telugu 574K 💾
ted Tepo Krumen 346K 💾
tem Timne 190K 💾
teo Teso 118K 💾
ter Tereno 187K 💾
tfr Teribe 228K 💾
tgo Sudest 216K 💾
tgp Tangoa 228K 💾
thk Tharaka 150K 💾
ti Tigrinya 803K 💾
tif Tifal 413K 💾
tih Timugon Murut 879K 💾
tik Tikar 264K 💾
tim Timbe 206K 💾
tk Turkmen 516K 💾
tlb Tobelo 209K 💾
tlf Telefol 422K 💾
tlj Talinga-Bwisi 159K 💾
tmc Tumak 245K 💾
tna Tacana 216K 💾
tnr Ménik 254K 💾
to Tonga 1,214K ���
tob Toba 229K 💾
toc Coyutla Totonac 218K 💾
toh Gitonga 194K 💾
top Papantla Totonac 168K 💾
tos Highland Totonac 224K 💾
tpi Tok Pisin 8,049K 💾
tpm Tampulma 892K 💾
tpp Pisaflores Tepehua 162K 💾
tpt Tlachichilco Tepehua 173K 💾
tpz Tinputz 370K 💾
tqo Toaripi 215K 💾
tr Turkish 13,846K 💾
trs Chicahuaxtla Triqui 287K 💾
tsz Purepecha 129K 💾
tt Tatar 1,356K 💾
ttc Tektiteko 231K 💾
tte Bwanabwana 198K 💾
tue Tuyuca 141K 💾
tuf Central Tunebo 237K 💾
twb Western Tawbuid 198K 💾
twu Termanu 242K 💾
txa Tombonuo 224K 💾
txu Kayapó 354K 💾
tyv Tuvinian 614K 💾
tyz Tày 260K 💾
tzh Tzeltal 901K 💾
tzj Tz'utujil 245K 💾
ubr Ubir 222K 💾
ubu Umbu-Ungu 308K 💾
udm Udmurt 135K 💾
udu Uduk 287K 💾
ug Uyghur 9,493K 💾
uk Ukrainian 12,921K 💾
ur Urdu 3,622K 💾
ura Urarina 193K 💾
urb Urubú-Kaapor 347K 💾
urk Urak Lawoi' 368K 💾
ury Orya 301K 💾
usa Usarufa 171K 💾
usp Uspanteco 228K 💾
uvl Lote 277K 💾
uz Uzbek 131K 💾
vag Vagla 221K 💾
vec Venetian 2K 💾
vec-u-sd-itpd Venetian (Padua) 813K 💾
vec-u-sd-itts Venetian (Trieste) 12K 💾
vec-u-sd-itvr Venetian (Verona) 16K 💾
vid Vidunda 151K 💾
viv Iduna 220K 💾
vmw Makhuwa 130K 💾
vun Vunjo 141K 💾
vut Vute 206K 💾
waj Waffa 236K 💾
wap Wapishana 193K 💾
war Waray 208K 💾
way Wayana 143K 💾
wer Weri 209K 💾
wiu Wiru 232K 💾
wlx Wali 847K 💾
wmw Mwani 139K 💾
wnc Wantoat 238K 💾
wnu Usan 234K 💾
wob Wè Northern 270K 💾
wos Hanga Hundi 264K 💾
wrs Waris 213K 💾
wsk Waskia 239K 💾
wuv Wuvulu-Aua 187K 💾
wwa Waama 239K 💾
xal Kalmyk 135K 💾
xav Xavánte 440K 💾
xed Hdi 229K 💾
xla Kamula 230K 💾
xog Soga 127K 💾
xrb Eastern Karaboro 286K 💾
xsb Sambal 244K 💾
xsi Sio 319K 💾
xsm Kasem 604K 💾
xsr Sherpa 184K 💾
xsu Sanumá 408K 💾
xtd Diuxi-Tilantongo Mixtec 277K 💾
xtm Magdalena Peñasco Mixtec 335K 💾
xuo Kuo 306K 💾
yaa Yaminahua 204K 💾
yad Yagua 142K 💾
yal Yalunka 203K 💾
yam Yamba 277K 💾
yaz Lokaa 222K 💾
yby Yaweyuha 219K 💾
ycn Yucuna 202K 💾
yle Yele 298K 💾
yli Angguruk Yali 221K 💾
yml Iamalele 245K 💾
yo Yoruba 270K 💾
yon Yongkom 202K 💾
yrb Yareba 184K 💾
yre Yaouré 285K 💾
yss Yessan-Mayo 227K 💾
yua Yucateco 813K 💾
yuj Karkar-Yuri 258K 💾
yut Yopno 227K 💾
yuw Yau (Morobe Province) 243K 💾
yva Yawa 250K 💾
zaa Sierra de Juárez Zapotec 265K 💾
zad Cajonos Zapotec 180K 💾
zae Yareni Zapotec 248K 💾
zap Zapotec 194K 💾
zas Santo Domingo Albarradas Zapotec 184K 💾
zaw Mitla Zapotec 157K 💾
zca Coatecas Altas Zapotec 236K 💾
zia Zia 242K 💾
ziw Zigula 140K 💾
zlm Malay 664K 💾
zne Zande 253K 💾
zpc Choapan Zapotec 208K 💾
zpi Santa María Quiegolani Zapotec 209K 💾
zpq Zoogocho Zapotec 208K 💾
zpt San Vicente Coatlán Zapotec 229K 💾
zpz Texmelucan Zapotec 281K 💾
zyp Zyphe Chin 230K 💾

¹ Downloadable files include counts for each token; to get raw text, run the crawler yourself. For breaking text into words, we use an ICU word break iterator and count all tokens whose break status is one of UBRK_WORD_LETTER, UBRK_WORD_KANA, or UBRK_WORD_IDEO.

Running the Crawler

./corpuscrawler --language=yo --output=./corpus