Wikidata:Lexicographical data/Documentation/Languages/sk
Slovak (Q9058) is the national language of Slovakia (Q214).
This page describes in great detail how Slovak lexicon maps to Wikidata lexemes. It is both a guide for Wikidata editors and documentation for data consumers.
Sources
[edit]Standard Slovak (spisovná slovenčina = literary Slovak) is codified in four documents published by Ľudovít Štúr Institute of Linguistics (Q2368451), the language regulator (Q2093358) for Slovak language, and Matica slovenská (Q763567):
- Pravidlá slovenského pravopisu (4th edition) (Q107406453) (PDF, dictionary)
- Krátky slovník slovenského jazyka (5th edition) (Q107406568) (2003 edition available online: introduction, dictionary)
- Pravidlá slovenskej výslovnosti (2nd edition) (Q107409937)
- Morfológia slovenského jazyka (Q107406442) (online)
Note that Wikidata also carries lexemes for non-standard Slovak, including regional dialects. JÚĽŠ SAV itself publishes a number of dictionaries that go beyond the scope of standard Slovak.
Information in English about Slovak language is available from slovake.eu, coauthored by Ľudovít Štúr Institute of Linguistics (Q2368451), and of course from Wikipedia's Slovak language entry.
Notability
[edit]There's a proposal for common lexeme notability rules. This section focuses on specifics of Slovak language. It is usually obvious what is and what is not a Slovak word or phrase. Some words and phrases are however better represented by items than lexemes. The rules below clarify ambiguous cases.
Included
[edit]- Slovak geographic names (e.g. Bratislava (L245474)) are included as proper noun (Q147276), an instance of toponym (Q7884789), with item for this sense (P5137) pointing to the relevant item.
- Slovak given names (e.g. Peter (L245710)) are included as proper noun (Q147276) with sense linking to corresponding item (e.g. Peter (Q2793400)) that is an instance of male given name (Q12308941) or female given name (Q11879590).
- Letters (e.g. č (L401045)) are included. Although there is Slovak alphabet (Q1187758) item, letters as language-specific lexemes make it easier to add pronunciation (e.g. kv and kvé for q) and perhaps other properties. See discussion about letters.
- Forms implied by inflection pattern are included as long as the inflection pattern is clear from attestable forms of the lexeme. These forms don't have to be attested on their own.
- Word from other language is included (1) if it changes spelling in Slovak (e.g. softvér (L403023)), (2) if it is inflected in Slovak (e.g. drink (L449254)), or (3) if it is commonly used in Slovak sentences (e.g. country (L412308) music).
Excluded
[edit]- Full names of organizations and notable persons are added as Slovak label or AKA to the relevant item (e.g. Slovakia (Q214), Ľudovít Štúr (Q315222)).
- Abbreviations of organization names (e.g. OSN) are added as short name (P1813) to the relevant item (e.g. United Nations (Q1065)).
- Surnames (e.g. Dzurinda (Q69994040)) are currently represented by items that are instances of family name (Q101352). This is common practice on Wikidata across languages, although Slovak language might need an exception in the future due to complex inflection of surnames.
- Numbers (e.g. 3 (Q201)) and Roman numerals (e.g. Ⅸ (Q3594836)) are represented by items.
- If a well-known paradigm defect (e.g. plurale tantum (Q138246), absolute adjective (Q332375)) explains absence of some forms (in a large enough corpus), these forms are excluded.
- Reflexive verbs with no change in meaning (umyť si ruky = umyť (L402796) sebe ruky) are excluded.
Lexeme granularity
[edit]Lexeme granularity is based on distinction between inflection and derivation on morphological level and between homonymy and polysemy on semantic level. Derivation generates lexemes while inflection generates forms. Homonymy generates lexemes while polysemy generates senses. Literature is clear about which morphological phenomena constitute inflection or derivation, but definition of homonym varies between authors and applications. Since most definitions of homonym take into account relative similarity of meaning, there is always a gray zone of ambiguous cases.
Borderline derivation methods that generate lexemes (obvious cases are not listed):
- gender inflection (Q1124523) (e.g. doktor (L249558) vs. doktorka (L403085))
- diminutive (Q108709) (e.g. vozík (L403679)) and augmentative (Q1358239) (e.g. chlapisko (L525262))
- reflexive verb (Q13475484) (e.g. hrať sa (L401977) vs. hrať (L245558)).
- grammatical aspect (Q208084) (e.g. prísť (L245478) vs. prichádzať (L247543))
Semantic phenomena that generate lexemes:
- different lexical category (e.g. naj (L402152), naj (L402151), naj (L402177))
- clearly different etymology (e.g. džin (L472227) vs. džin (L492801))
- participle (Q814722) or verbal noun (Q1350145) that acquires non-trivial meaning (e.g. veliaci (L525141), pečený (L402846), čítanie (L250034))
- different referent of pronoun (e.g. ty (L238338) vs. on (L238291), tvoj (L245536) vs. jeho (L238307))
Morphological and semantic phenomena that generate forms or senses:
- obviously grammatical number (Q104083), case (Q128234), grammatical person (Q690940), grammatical tense (Q177691), grammatical mood (Q184932)
- comparison (Q577714) of adjectives and adverbs
- participle (Q814722) and verbal noun (Q1350145) of verbs (unless they acquire secondary senses)
- negated verbs (e.g. nehrať for hrať (L245558))
- spelling variations (e.g. hm (L477122)).
- uninflected variation of inflected lexeme (e.g. foto (L400900))
- irregular inflection, especially of pronouns (e.g. mnou and mne of ja (L238305))
- ambiguous grammatical aspect (Q208084) (e.g. absolvovať (L249542))
- ambiguous grammatical gender (Q162378) of nouns if it applies to all senses (e.g. džínsy (L460043))
- ambiguous countability (Q107063735) (e.g. cibuľa (L401090))
Handling gray zone between homonymy and polysemy:
- Sense-dependent noun gender (e.g. kura (L481695) vs. kura (L404168)), including fine-grained masculine gender (e.g. pamätník (L402333) vs. pamätník (L404436)), is a strong indicator that these senses are actually homonyms, but similarity of meaning can weigh against it (e.g. netvor (L410253)).
- Homonyms are assumed to exist if they are listed in authoritate source (usually one of JÚĽŠ dictionaries).
- When not sure about homonymy/polysemy, keep the suspected homonyms as senses of one lexeme.
Temporary lexemes that exist as workarounds for technical limitations:
- Sense-dependent inflection is currently represented by two lexemes (e.g. uchá in ucho (L249402) vs. uši in ucho (L299083), bez desiaty in desiata (L460083) vs. bez desiatej in desiata (L490903)).
- Different spellings of inflected words are in separate lexemes (e.g. mol (L404214) vs. mól (L490909)) by default unless there's an easy way to differentiate forms of the two spelling variants (e.g. using gender).
- Adapted loanword (e.g. gej (L487720)) is a separate lexeme from corresponding unadapted loanword (in this case gay (L405562)).
Lexical category
[edit]Slovak language has 10 basic lexical categories:
- noun (Q1084) (includes proper noun (Q147276))
- adjective (Q34698)
- pronoun (Q36224)
- numeral (Q63116)
- verb (Q24905)
- adverb (Q380057)
- preposition (Q4833830)
- conjunction (Q36484)
- grammatical particle (Q184943)
- interjection (Q83034)
These 10 categories (plus proper noun (Q147276)) should be used to classify all regular Slovak lexemes. More fine-grained categorization can be added via instance of (P31) statements. The advantage of instance of (P31) is that it is not exclusive nor exhaustive, allowing categorization along multiple axes as well as partial categorization.
Pronoun category
[edit]Slovak pronouns include many more words than what the definition of pronoun (Q36224) would lead you to believe. Pronoun scope varies by language. Here's comparison of several Slovak and English words for illustration:
Type of Slovak pronoun | Example | Lexical category in Wikidata | ||
---|---|---|---|---|
English | Slovak | English | Slovak | |
substantive pronoun (substantívne zámeno) | nothing (L4317) | nič (L245513) | pronoun (Q36224) | pronoun (Q36224) |
adjective pronoun (adjektívne zámeno) | such (L248802) | taký (L245463) | determiner (Q576271) | pronoun (Q36224) |
adverbial pronoun (príslovkové zámeno) | everywhere (L8978) | všade (L249269) | adverb (Q380057) | pronoun (Q36224) |
numeral pronoun (číslovkové zámeno) | much (L4212) | veľa (L245539) | determiner (Q576271) | pronoun (Q36224) |
Slovak pronouns approach pro-form (Q2006180) in scope. There are however several reasons to avoid pro-form and its subclasses as lexical categories:
- Slovak zámeno literally translates as pronoun. This suggests it was historically the same concept that just evolved to fit the language.
- Slovak language historically used narrower definition of pronoun that was later expanded (per Morfológia slovenského jazyka p. 233).
- People readily translate zámeno as pronoun. Nobody knows what's a pro-form.
- Term pronoun is used in pronoun section of slovake.eu for all Slovak pronouns including adverbial and numeral ones. Website slovake.eu is coauthored by Slovak language regulator Ľudovít Štúr Institute of Linguistics (Q2368451).
- Several Slavic languages have similarly broad pronoun definitions and not one of them uses pro-form (Q2006180) for classification here on Wikidata.
Even though pronoun (Q36224) is used as lexical category, it may still be useful to differentiate pronouns by role (substantive, adjective, adverbial, numeral) using instance of (P31) statements.
Non-word categories
[edit]In addition to the parts of speech listed above, Slovak lexemes can belong in one of the special lexical categories:
Lemma
[edit]For inflected lexemes other than verbs, lemma is identical to the form with the following grammatical features (if available):
For verbs, lemma is the infinitive (Q179230) form.
For lexical categories that are not inflected, lemma is the simplest, shortest form (e.g. zas (L250046)).
Statements
[edit]All applicable properties defined in Wikidata can be used in Slovak lexemes, forms, and senses. This section merely provides overview of the most commonly used properties and classes.
Related lexemes
[edit]- homograph lexeme (P5402) (true homographs only)
Classes
[edit]Lexical categories can be further subdivided using instance of (P31) statements.
noun (Q1084) (includes proper noun (Q147276))
- singulare tantum (Q604984), plurale tantum (Q138246)
- count noun (Q1520033), mass noun (Q489168) (may be used together if the noun is ambiguous)
Noun gender
[edit]Slovak nouns have one of the following grammatical genders assigned via grammatical gender (P5185):
- masculine (Q499327)
- masculine animate (Q54020116)
- masculine personal (Q27918551)
- masculine animate non-personal (Q52943193) (also called animal masculine)
- masculine inanimate (Q52943434)
- masculine animate (Q54020116)
- feminine (Q1775415)
- neuter (Q1775461)
Sources usually only mention the coarse-grained gender (masculine, feminine, neuter) and treat animate/inanimate and personal/impersonal distinction as additional traits of masculine nouns that influence inflection of adjectives as well as inflection of the noun itself. The end result is nevertheless the same as accepting fine-grained gender as defined above. The two views are semantically equivalent in Wikidata. For example, masculine animate non-personal (Q52943193) is identical to combination of masculine (Q499327), animate (Q51927507), and impersonal (Q67372837) via subclass of (P279) relationships. Using masculine animate non-personal (Q52943193) instead of its three superclasses is therefore just a matter of convenience and brevity.
Special cases:
- Some nouns have ambiguous gender that is shared by all senses (e.g. džínsy (L460043)). In that case, there are multiple grammatical gender (P5185) statements.
- Some masculine nouns have fine-grained masculine gender specified on sense level (e.g. baran (L465326)).
- Grammatical gender of lexemes denoting persons is usually identical to natural gender, but there are exceptions (e.g. gorila (L525324)). Exceptions also exist for fine-grained masculine gender (e.g. ježko (L469905)).
Verb aspect
[edit]Slovak verbs on Wikidata are classified via grammatical aspect (P7486) property as having one of two aspects:
Most Slovak verbs form perfective-imperfective pairs (e.g. písať (L245830) - napísať (L245926)). Some sources use more fine-grained classification.
Exceptions:
- Some verbs are both perfective and imperfective (absorbovať (L470100)).
Forms
[edit]Noun forms
[edit]Slovak nouns have 12 forms, one for every combination of grammatical features:
- grammatical number (Q104083): singular (Q110786), plural (Q146786)
- case (Q128234): see has grammatical case (P2989) on Slovak (Q9058) (query)
Exceptions:
- Some nouns have only plural forms (e.g. dvere (L245874)) or only singular forms (e.g. Švédsko (L250434)). These nouns can be marked as instances of plurale tantum (Q138246) and singulare tantum (Q604984).
- Sometimes there are multiple forms with the same number and case, for example číslo (L245861) has forms čísel and čísiel in G sg.
- Some nouns have only a small subset of forms, for example filip (L299439).
- Some nouns have two genders with the same senses, for example plus (L250182). If forms differ depending on gender, then gender is included as grammatical feature in all forms.
- Masculine nouns with sense-dependent fine-grained gender (e.g. baran (L465326)) must differentiate between forms specific to the fine-grained genders.
- Affected forms have additional grammatical feature animate (Q51927507) / inanimate (Q51927539) in singular and personal (Q67372736) / impersonal (Q67372837) in plural.
- If such grammatical features are not precise enough, then subject form (P5830) can be used to cherry-pick only some forms matching the gender (e.g. baran (L465326)).
- Some nouns, especially loanwords, are not inflected, for example finále (L248985). In this case, there is only one form without any grammatical features.
- If the uninflected form is used only in singular (e.g. karate (L405806)) or plural (e.g. (L298832)), it still has no grammatical features, but the lexeme is marked as an instance of plurale tantum (Q138246) or singulare tantum (Q604984).
- If the lexeme is also used as inflected (e.g. foto (L400900)), the special uninflected form without grammatical features is listed alongside inflected forms with grammatical features.
- Vocative case is used only rarely (e.g. človek (L238365)).
Adjective forms
[edit]If we define one form for all 6 cases, 2 numbers, 5 genders, and 3 comparison degrees, we will end up with 180 forms per adjective. That would be hard to edit and hard to display in Wiktionaries. We will instead try to minimize duplication by carefully choosing sensible combinations of grammatical features.
- comparison (Q577714): positive (Q3482678), comparative (Q14169499), superlative (Q1817208)
- grammatical number (Q104083): singular (Q110786), plural (Q146786)
- case (Q128234): see has grammatical case (P2989) on Slovak (Q9058) (query), except vocative case (Q185077)
- grammatical gender (Q162378)
- if singular:
- if plural:
- nominative, accusative: masculine personal (Q27918551), not masculine personal (Q54152717)
- other cases: unspecified = all genders
This adds up to 81 forms per adjective. These forms are largely free of duplication. Although item not masculine personal (Q54152717) is not a gender, it's a commonly used gender group that is allowed as a grammatical feature. We could similarly unify 4 singular cases for masculine and neuter genders to reduce form count down to 69, but such masculine/neuter gender group is not commonly used anywhere. Creating it as a new concept just for Wikidata would make the data harder to use.
Exceptions:
- If the adjective is an absolute adjective (Q332375), it has only positive comparison degree.
- Some adjectives have only a small subset of forms, e.g. rád (L245631).
- Some adjectives, especially loanwords, are not inflected, e.g. nanič (L403539) or gay (L405463). In this case, there is only one form with grammatical feature positive (Q3482678). If the adjective is comparable (e.g. sexi (L410041), naj (L402151)), there is one form per comparison degree.
- Adjective naj (L402151) does not have positive degree.
Adverb forms
[edit]The only grammatical feature of Slovak adverbs is comparison degree:
Exceptions:
- Many adverbs are not comparable (podvečer (L402432)).
- Rare few adverbs do not have positive degree (prv (L252020)).
- Some adverbs have comparative but no superlative (potichu (L402264)). This has nothing to do with attestation. Some superlatives are just inherently meaningless.
- Adverb naj (L402152) does not have positive degree.
Senses
[edit]Sense granularity
[edit]What constitutes separate sense is a difficult theoretical question. This document builds on two core principles:
- Truly different meanings (with different etymology) are isolated in homograph lexemes. Senses in single homograph are all related.
- Sense is unique if it has (or should have) unique set of statements, especially item for this sense (P5137), translation (P5972), and synonym (P5973).
While this seems straightforward, complexity and fluidity of Wikidata complicates matters a lot.
The following rules are observed in senses of Slovak lexemes:
- In Slovak, masculine gender doubles as default gender. Nouns denoting persons (e.g. volič (L249216)) have secondary sense denoting male person unless there is no corresponding feminine noun (e.g. predok (L252106)).
Glosses
[edit]There is no Wikidata-wide consensus on content of sense glosses. Glosses are intended for sense disambiguation, following the same rules as item descriptions (see Help:Description). Most Slovak glosses follow some basic rules:
- Gloss is not a definition. See below for information about definitions.
- Gloss has only two hard requirements:
- Gloss is unique within the lexeme. If the lexeme has homographs, glosses should be unique across all homographs.
- Gloss is informative enough to allow skilled speakers of Slovak to pair glosses with definitions in external dictionaries.
- Of glosses that meet the above two requirements, the best gloss is usually the shortest one. Ideal gloss length is single word.
- Glosses in other languages translate Slovak gloss, not the sense itself. For example, if a sense for bažant (L484448) has gloss vták, add English gloss bird, not pheasant. To add translations, use translation (P5972) property or rely on indirect translations via item for this sense (P5137).
- While glosses in other languages have some use in downstream dictionaries, adding translation (P5972) statements is more important.
Hint: A good way to choose gloss is to reference the class of the corresponding item, e.g. animal for any lexeme naming some animal. If that is not precise enough, combine several such classes, e.g. male animal.
Definitions
[edit]Wikidata does not store textual definitions. Sense is primarily defined via item for this sense (P5137) while other statements add nuance. Glosses should not be abused to store definitions. Nor should they be abused to add clarifications on top of structured data. Glosses should be however clear enough to identify definition in one of the external sources listed below.
- JÚĽŠ dictionaries (almost exhaustive)
- Slovak Wikipedia (rich source)
- Slovak Wiktionary (poor source)
- English Wiktionary (poor source)
- Wikidata items with Slovak label (poor source)
Translations
[edit]Please do add English translations even for senses that have item for this sense (P5137). Combination of the two will enable accurate triangulation of translations. Translations to other languages are useful only if they can outperform accuracy of automated triangulation.
Translations can be used to semi-automatically inherit information from corresponding senses in other languages: item for this sense (P5137), foreign language glosses, context labels (language style (P6191), field of usage (P9488), etc.), and translations to other languages. Translations, especially English translations, are therefore more valuable than other statements. Doing them first saves time. Addition of symmetric translations can be also mostly automated.
Queries
[edit]Statistics
[edit]- language comparison: Grafana, WP coverage, lexemes, forms, senses, derivations, sense links, more stats
- all lexemes by: lexical category
- all forms by: lexical category
- nouns by: gender, countability, paradigm defect
- proper nouns by: gender, paradigm defect
- verbs by: aspect
Available data
[edit]- all lexemes
- recent changes via: related changes, query
- homographs
- nouns
- genders: masculine personal, masculine animate non-personal, masculine inanimate, feminine, neuter, sense-dependent, ambiguous
- countability: countable, uncountable, ambiguous
- declension exceptions: singulare tantum, plurale tantum, indeclinable, optionally declinable, vocative
- proper nouns
- genders: masculine personal, masculine inanimate, feminine, neuter
- declension exceptions: singulare tantum, plurale tantum, indeclinable, optionally declinable
- adjectives
- declension exceptions: indeclinable, indeclinable comparable
- verbs
- aspects: imperfective, perfective, ambiguous
- pronouns, numerals, adverbs, prepositions, conjunctions, particles, interjections, abbreviations, letters, suffixes
Missing data
[edit]- missing forms: nouns, proper nouns, adjectives, verbs, adverbs, uninflected categories
- nouns missing: gender, massness
- proper nouns missing: gender
- verbs missing: aspect
Invalid data
[edit]- lexemes with: duplicates (by lemma), unknown lexical category, uninflected forms
- forms with: duplicates (by representation, features)
- forms without grammatical features: adjectives, verbs
- nouns with: unknown gender, underspecified gender
- unknown grammatical features: nouns, adjectives, adverbs
Resources
[edit]Language information
[edit]- Ľudovít Štúr Institute of Linguistics (Q2368451) (website), language regulator
- Slovak language on Wikipedia
- slovake.eu, Slovak as second language, by JÚĽŠ
- Pravidlá slovenského pravopisu (dictionary), prescriptive, by JÚĽŠ
- Krátky slovník slovenského jazyka (introduction), prescriptive, by JÚĽŠ
- Pravidlá slovenskej výslovnosti, prescriptive
- Morfológia slovenského jazyka, prescriptive, by JÚĽŠ
- Assorted eBooks from JÚĽŠ
Corpora
[edit]- Slovak Wikipedia, CC-BY_SA or GFDL
- Slovenský národný korpus (Slovak national corpus), by JÚĽŠ, unspecified license
- Common Crawl, multilingual, custom license
- Tatoeba, sentences, some with audio, multilingual, 14K+ in SK, CC-BY 2.0 FR, audio under CC-BY/CC-BY-SA/CC-BY-NC
- Common Voice, audio, multilingual, SK not yet included, CC-0
- Common Voice sentences, multilingual, SK sentences are few and of low quality, CC-0
- Universal Dependencies, treebank, multilingual, 100K+ words in SK, CC-BY-SA
- List of text corpora on Wikipedia (TODO: pick the ones relevant to SK)
Frequency lists
[edit]- Slovenský národný korpus (Slovak national corpus), by JÚĽŠ, lemmas, words, N-grams, unspecified license
Dictionaries
[edit]- JÚĽŠ dictionaries, prescriptive, unspecified license
- Slovak Wiktionary, CC-BY_SA or GFDL
- Slovak opensource spellchecker project (GitHub), MPL
- Slovak hyphenation tables, GPL or LGPL or MPL
- Malý anglicko-slovenský slovník (Small English-Slovak dictionary), 14K+ entries, GFDL or GPL or CC-BY-NC-SA
- Otvorený slovenský synonymický slovník (OpenThesaurus-SK), 15K+ words, permissive license
- Slovník do vrecka (Pocket dictionary), 100K+ entries, GFDL
- Slovník cudzích slov (Dictionary of foreign words), GFDL
- Veľký slovník cudzích slov (Large dictionary of foreign words), inactive, GFDL
- Slovenská terminologická databáza (Slovak terminology database), by JÚĽŠ, 15K+ entries, GFDL
- WordNet, multilingual, 30K in SK, CC-BY-SA or AGPL or ODbL
Tools
[edit]- MachtSinn - Use items with Slovak label to populate lexeme senses.
- Wikidata Senses - Add first sense to every lexeme.
Contact
[edit]Please ping Robert Važan when discussing Slovak lexemes on Wikidata.