A Nomenclature System for the Tree of Human Y-Chromosomal Binary Haplogroups

Abstract

The Y chromosome contains the largest nonrecombining block in the human genome. By virtue of its many polymorphisms, it is now the most informative haplotyping system, with applications in evolutionary studies, forensics, medical genetics, and genealogical reconstruction. However, the emergence of several unrelated and nonsystematic nomenclatures for Y-chromosomal binary haplogroups is an increasing source of confusion. To resolve this issue, 245 markers were genotyped in a globally representative set of samples, 74 of which were males from the Y Chromosome Consortium cell line repository. A single most parsimonious phylogeny was constructed for the 153 binary haplogroups observed. A simple set of rules was developed to unambiguously label the different clades nested within this tree. This hierarchical nomenclature system supersedes and unifies past nomenclatures and allows the inclusion of additional mutations and haplogroups yet to be discovered.

[Supplementary Table 1, available as an online supplement at www.genome.org, lists all published markers included in this survey and primer information.]

In recent years, an explosion in data from the nonrecombining portion of the Y chromosome (NRY) in human populations has been witnessed. This explosion has been driven, in part, by the many recently discovered polymorphisms on the NRY. There has been a keen interest in using polymorphisms on the NRY to examine questions about paternal genetic relationships among human populations since the mid-1980s (Casanova et al. 1985). In more recent years, a use has been found for these polymorphisms in DNA forensics (Jobling et al. 1997), genealogical reconstruction (Jobling 2001), medical genetics (Jobling and Tyler-Smith 2000) and human evolutionary studies (Hammer and Zegura 1996). The low level of polymorphism on the NRY hindered research for many years. By the end of 1996, there were fewer than 60 known polymorphisms on the NRY. Most of these (∼80%) were long-range polymorphisms (detectable by pulsed-field electrophoresis), conventional restriction fragment length polymorphisms (RFLPs), or short tandem repeats (STRs). Until 1997, there were only 11 known binary polymorphisms that could be genotyped by PCR-based methods (Hammer 1994; Seielstad et al. 1994; Hammer and Horai 1995; Whitfield et al. 1995; Santos et al. 1995; Jobling et al. 1996; Underhill et al. 1996). These included single nucleotide polymorphisms (SNPs), an Alu insertion polymorphism, and a deletion. Then, in 1997, Underhill et al. (1997) published 19 new PCR-based binary polymorphisms that were discovered by a novel and efficient mutation detection method known as denaturing high-performance liquid chromatography (DHPLC). This method has since been used to discover more than 200 SNPs and small insertions/deletions (indels) on the NRY (Shen et al. 2000, Underhill et al. 2000; Hammer et al. 2001). These polymorphisms are particularly useful because of their low rate of parallel and back mutation, which makes them suitable for identifying stable paternal lineages that can be traced back in time over thousands of years. As the number of known binary polymorphisms increased, so did the number of publications and the number of different systems used to name these binary haplogroups. Currently, there are at least seven different nomenclature systems in use, making it very difficult to compare results from one publication to the next. Our purpose here is twofold: (1) to construct a highly resolved tree of NRY binary haplogroups by genotyping most published PCR-based markers on a common set of samples, and (2) to describe a new nomenclature system that is flexible enough to allow the inevitable changes that will result from the discovery of new mutations and NRY lineages. We hope that the nomenclature presented here will be adopted by the community at large and will improve communication in this highly interdisciplinary field.

RESULTS AND DISCUSSION

NRY Haplogroup Tree and Haplogroup Nomenclature

We constructed a comprehensive haplogroup tree for the human NRY by genotyping most of the known polymorphisms on the NRY in a single set of samples (74 male Y Chromosome Consortium [YCC] cell lines). Some polymorphisms known to be variable in other DNAs showed no variation in the YCC panel; therefore, additional samples were included to improve the resolution of the phylogeny. This served to increase the number of polymorphic sites mapped onto the haplogroup tree to 237. Two mutational events occurred at each of eight sites. However, these recurrent mutations were found on different haplogroup backgrounds and thus were distinguishable events. The 245 mutational events gave rise to 153 NRY haplogroups. The single most parsimonious tree for these 153 NRY haplogroups is shown in Figure 1, with mutational events shown along the branches.

Figure 1.

The single most parsimonious tree of 153 haplogroups (left) showing correspondences with prior nomenclatures (right). The root of the tree is denoted with an arrow. Haplogroup names and Y Chromosome Consortium (YCC) sample numbers are given at the tips of the tree, and major clades are labeled with large capital letters and shaded in color (the entire cladogram is designated haplogroup Y). The “*” symbol indicates an internal node on the tree or paragroup (see text). For space reasons, subclade labels are entered to the left of the corresponding links. Mutation names are given along the branches; major clades are labeled with a larger font than are their subclades. The length of each branch is not proportional to the number of mutations or the age of the mutation; each subclade is given a unit of depth in the tree. Some of the branches were elongated artificially to make room for a number of phylogenetically equivalent markers on a single branch. The order of phylogenetically equivalent markers shown on each branch is arbitrary. Prior nomenclatures are named according to author and are taken from the following publications: (α) Jobling and Tyler-Smith (2000) and Kaladjieva et al. (2001); (β) Underhill et al. (2000); (γ) Hammer et al. (2001); (δ) Karafet et al. (2001); (ε) Semino et al. (2000); (ζ) Su et al. (1999); and (η) Capelli et al. (2001). Noncontiguous naming systems in prior nomenclatures result either from the use of non-PCR markers that have not been typed on the YCC panel or unpublished lineage definitions. Prior haplogroup names shown in red are found in more than one position in the phylogeny. Cross-hatching within the “Semino” nomenclature indicates lineages that cannot be named according to their system. Mutations M104 and P22 on lineage M2 are independent discoveries of the same polymorphic marker.

The tree was drawn as asymmetrically as possible by sorting the descendants of each interior node so that the bottom-most descendant had the greatest number of immediate descendants. The position of the root in Figure 1 (indicated by an arrow) was determined by outgroup comparisons. In other words, whenever possible, homologous regions on the NRY of closely related species (e.g., chimpanzees, gorillas, and orangutans) were sequenced to determine the ancestral states at human polymorphic sites (Underhill et al. 2000, Hammer et al. 2001). The root of the tree falls between a clade defined by M91 and a clade defined by a set of markers: SRY10831a, M42, M94, and M139. The NRY tree in Figure 1 can be seen as a series of nested monophyletic clades (i.e., a set of lineages related by a shared, derived state at a single or set of sites). To devise a nomenclature system at a reasonable scale, we assigned a capital letter to several of the major clades, beginning with the letter A (for the haplogroup above the position of the root in Fig. 1) and continuing through the alphabet to the letter R. The letter Y was assigned to the most inclusive haplogroup comprising haplogroups A–R. Deciding which clades are to receive the highest labeling level can only be, to some extent, arbitrary. Here, we label with single capital letters those clades that seem to us to represent the major divisions of human NRY diversity. Only 19 letters have been assigned to clades to allow for the possible expansion and further resolution of this phylogeny (the implications of which are discussed below).

We propose here two complementary nomenclatures. The first is hierarchical and uses selected aspects of set theory to enable clades at all levels to be named unambiguously. The capital letters (A–R) used to identify the major clades constitute the front symbols of all subsequent subclades (Fig. 1). Unlabeled clades can be named as the “join” of two subclades; for example, clade CR includes all chromosomes that share the derived state of the M168 and P9 polymorphisms. Note that this is distinct from the set theoretic “union,” which, in the above example, would not define a monophyletic clade. Lineages that are not defined on the basis of a derived character represent interior nodes of the haplogroup tree and are potentially paraphyletic (i.e., they are comprised of basal lineages and monophyletic subclades). Thus, we suggest the term “paragroup” rather than haplogroup to describe these lineages. Paragroups are distinguished from haplogroups (i.e., monophyletic groupings) by using the * (star) symbol, which represents chromosomes belonging to a clade but not its subclades. For example, paragroup B* belongs to the B clade; however, it does not fall into haplogroup B1 or B2. As illustrated in Figure 2, internal nodes are highly sensitive to changes in tree topology. Thus, the * symbol cautions that a given paragroup name may refer to different sets of chromosomes in succeeding versions of the phylogeny.

Figure 2.

Potential examples of revisions in topology necessitated by the discovery of new mutations and new samples with intermediate haplogroups. Haplogroup nomenclature systems are shown to the right of the tree. (A) The G and H haplogroups are as shown in Figure1. (B) Case of a newly discovered marker that joins haplogroups within haplogroup G. (C) Newly discovered mutation (μ) that splits clades within haplogroup G. (D) Case of a newly discovered sample with the derived state at M52 and the ancestral state at M69. Names shown in boxes indicate haplogroup names that require changes from those shown in A. Dotted lines indicate newly created lineages.

Subclades nested within each major haplogroup defined by a capital letter are named using an alternating alphanumeric system. For example, within haplogroup E, there are three basal haplogroups that are named E1, E2, and E3, and the underived paragroup becomes E*. Nested clades within each of these haplogroups are named in a similar way, except that lower-case letters are used instead of numerals. Again, paragroups are labeled with an * symbol, and the remaining haplogroups are labeled with an “a,” “b,” “c,” etc. This naming system continues to alternate between numerals and lower-case letters until the most terminal branches are labeled (tip haplogroups). Therefore, the name of each haplogroup contains the information needed to find its location on the tree.

Alternatively, haplogroups can be named by the “mutations” that define lineages rather than by the “lineages” themselves. Thus, we propose a second nomenclature that retains the major haplogroup information (i.e., 19 capital letters) followed by the name of the terminal mutation that defines a given haplogroup. We distinguish haplogroup names identified “by mutation” from those identified “by lineage” by including a dash between the capital letter and the mutation name. For example, haplogroup H1a would be called H-M36 (Fig.2). When multiple phylogenetically equivalent markers define a haplogroup, the one typed is used. For example, if M39 but not M138 were typed within haplogroup H1, then H1c becomes H-M39. If multiple equivalent markers were typed, this notation system omits some marker information, and a statement of which additional markers were typed should be included in the Methods section. Note that the mutation-based nomenclature has the important property of being more robust to changes in topology (Fig. 2).

While it is straightforward to name monophyletic clades, it is more challenging to devise a simple and flexible system to name underived interior nodes. This is especially important to facilitate the naming of haplogroups in studies where not all markers are typed, and to provide a standard set of names for previously described haplogroups (and paragroups). For instances where not all markers within a clade are typed, we introduce a bracketing system that encloses an “x” (for “excluding”) and the lineages that have been shown to be absent. This system can be applied equally well to the lineage-based and mutation-based nomenclatures. The following examples portray the lineage-based nomenclature first, followed by the mutation-based nomenclature. Lineages (or markers) excluded from a haplogroup are listed within parentheses after the name of the haplogroup (or the last derived marker in the case of the mutation-based nomenclature). For example, if M82-derived chromosomes are typed with all downstream markers, then the underived chromosomes belong to H1* or H-M82* (Fig.3A). However, if M82-derived chromosomes are typed only with M36, then the underived chromosomes belong to H1*(xH1a) or H-M82*(xM36) (Fig. 3B). If we apply this bracketing method to the naming of Underhill et al.'s (2000) paraphyletic haplogroup VI, then its label becomes F*(xK) or F-M89*(xM9) (Table1). In the more extreme case of a study genotyping only the YAP and M3 markers, chromosomes ancestral for both markers would be named Y*(xDE,Q3) or Y*(xYAP,M3), where Y refers to the most inclusive haplogroup encompassing the total cladogram. See Table 1 for application of this bracketing system to lineage-based names of previously published haplogroups. When using the mutation-based nomenclature, the adoption of this bracketing system is optional, as long as full lineage-based names of haplogroups have been given elsewhere in the manuscript (e.g., in the form of a table or a tree). The lineage- and mutation-based nomenclatures each has advantages and disadvantages, and each can be used where most appropriate.

Figure 3.

Examples of haplogroup names for cases in which subsets of markers in Figure 1 are genotyped. Markers that were not genotyped are shown with a strikethrough. The lineage- and mutation-based full nomenclature systems are shown to the right of the tree.

Table 1.

Details of the Markers Incorporated within Six Published Prior Nomenclature Systems, Illustrated in Figure 1

Cross-Referencing to Previous Nomenclatures

A number of investigators have developed nomenclature systems based on overlapping subsets of the markers typed here. To facilitate comparisons among seven previously published nomenclatures and our present proposed nomenclature, Figure 1 and Table 1 illustrate direct comparisons among these different systems. These nomenclature systems are extremely inconsistent (i.e., nonisomorphic) in how they define haplogroups. Moreover, when there is consistency between two systems (e.g., between Underhill et al.'s [2000] haplogroup V and Hammer et al.'s [2000] haplogroup 1F), different names are used for the same haplogroups. All of the major human NRY nomenclature schemes used thus far have included paraphyletic groupings (see Fig. 1), and these paragroups can be misinterpreted as being necessarily ancestral to “downstream” haplogroups containing derived characters. Three major benefits of the proposed system are (1) its ability to distinguish between underived interior nodes (paragroups) and monophyletic clades (haplogroups), (2) its flexibility in naming haplogroups at different levels of the phylogenetic hierarchy, and (3) its ability to accommodate new haplogroups as new mutations are discovered (see below). If broadly accepted and utilized, this system also will serve to standardize the names of NRY haplogroups in the literature.

Caveats and Changes in Nomenclature

In addition to the long-term challenges posed by any attempt to form a stable nomenclature system, there are several caveats that should be raised relating to the way the current tree topology was inferred. First, it is important to point out that not all polymorphisms were genotyped in all individuals. Indeed, continued genotyping of these polymorphisms may result in slight changes in the topology of the tree in Figure 1. It is also possible that some mutational events that were assumed to be unique actually are recurrent on the tree (i.e., there are undetected multiple hits at some additional sites). More importantly, because it is extremely difficult to devise a nomenclature system that is both informative in a phylogenetic sense and impervious to the need for renaming groups as new polymorphisms are discovered, a set of guidelines is needed to minimize the impact of future structural changes in the tree.

To facilitate the evolution of the present nomenclature, we make a number of proposals. Firstly, a nomenclature committee comprising some of the current participants in the YCC will receive requests from investigators who wish new binary markers or haplogroups to be incorporated into the nomenclature, and will decide on the changes to be made to the existing system. At any one time, the current nomenclature and the committee's contact details will be made available on the following URL: http://ycc.biosci.arizona.edu. Consequently, we recommend that if investigators wish to use new markers prior to their incorporation into the nomenclature, they distinguish between consensus and novel parts of the clade labels by use of a forward slash. For example, a new mutation (μ) that divides clade D1 in two creates D1/-μ and D1/-M15*. This makes it clear to the reader which parts of the label are specific to that study and which can be cross-referenced to other publications. This will minimize confusion should two contemporaneous papers introduce novel markers within the same clade. In this manner, information from VNTR and STR haplotypes also can be incorporated; a standard nomenclature for Y-STRs already is available (Gill et al. 2001). Because new versions of the YCC nomenclature will be published annually to reflect changes in the tree topology resulting from newly discovered mutations, we suggest that each paper cite the particular version of the YCC NRY tree that was used (e.g., YCC NRY Tree 2002).

Summary

The cladistic nomenclature of human mtDNA diversity adopted by many groups some years ago has greatly advanced studies of maternal lineages and the communication of their conclusions (Richards et al. 1998). By contrast, recent dramatic advances in the resolution of paternal lineages have resulted in multiple nomenclature systems that have hampered communication among NRY researchers and the scientific community at large. Here, we introduce a strictly phylogenetic (cladistic) nomenclature for human NRY variation based on the phylogeny of 153 paternal lineages. This system is flexible in its ability to assign haplogroup names at different levels of the phylogenetic hierarchy. The phylogeny of the human NRY lies at the heart of a multidisciplinary enterprise in which unambiguous communication is vital. The nomenclature proposed here along with guidelines for revisions, represent an important resource to those interested in medical, forensic, and evolutionary genetics alike.

METHODS

YCC Cell Lines

The YCC is a collaborative group involved in an effort to detect and study genetic variation on the human NRY. The YCC was initiated in 1991 by Michael Hammer and Nathan Ellis with the following goals: (1) to establish a repository of lymphoblastoid cell lines (YCC cell line repository) derived from a sample of males representing worldwide populations, (2) to provide DNA isolated from these cell lines to investigators searching for polymorphisms on the NRY, and (3) to establish a common database containing the results of typing DNAs from the Repository cell lines at as many Y-specific polymorphic sites as possible (YCC Newsletter: http://www.ycc.biosci.arizona.edu/ycc1.html). Lymphoblastoid cell lines were established at the New York Blood Center from blood donated by volunteers who gave informed consent. Additional cell lines were donated by Luca Cavalli-Sforza, Trefor Jenkins, Judy Kidd, and Ken Kidd; or were purchased from the Coriell Institute. See Table 2 for a list of the YCC cell lines, as well as associated geographic, ethnic, and linguistic information.

Table 2.

Geographic/Ethnic Origins and Language Affiliations of YCC Cell Line Donors

Other DNA Samples

In constructing the tree, a great deal of phylogenetic information was retained from previous studies. When markers from different laboratories mapped on the same branch of the tree, an attempt was made to determine the order of mutational events. Toward this end, a variety of samples was provided by each of the participating laboratories, all of which were obtained with informed consent. These samples represented known haplogroups that were not present in the YCC cell line DNAs and thus served to map many additional markers on the haplogroup tree.

Genotyping SNPs and Indels

The protocols for genotyping many of the 237 polymorphic sites analyzed have been published (see Underhill et al. 2000, 2001; Hammer et al. 2001, and references therein); some of these assays were converted from conventional RFLPs and DNA sequence data (e.g., Jobling 1994; Hammer et al. 1997; Pandya et al. 1998; Bergen et al. 1999;Shinka et al. 1999; Bao et al. 2000). The remainder will be published in future manuscripts. Recurrent mutations, observed at SRY10831, 12f2, MSY2, M116, M64, M108, P37, and P41 are counted as distinct polymorphisms. Supplementary Table 1 (available as an online supplement at http://www.genome.org) lists all published markers included in this survey and primer information.

Terminology

The terms “haplogroup” and “haplotype” have various, overlapping definitions in the literature. Here, we use the terminology of de Knijff (2000) in which “haplogroup” refers to NRY lineages defined by binary polymorphisms. The term “haplotype” is reserved for all sublineages of haplogroups that are defined by variation at STRs on the NRY (Y-STRs). Mutations labeled with the prefix “M” (standing for “mutation”) were published by Underhill et al. (2000,2001). Many of the mutations with the prefix “P” (standing for “polymorphism”) were described by Hammer et al. (1998, 2001). The eight recurrent mutational events are indicated by their mutation name followed by a or b.

YCC Organizers

Nathan Ellis (Memorial Sloan-Kettering Cancer Center), Michael Hammer (University of Arizona).

Genotyping

Michael Hammer (University of Arizona), Matthew E. Hurles (McDonald Institute for Archaeological Research), Mark A. Jobling (University of Leicester), Tatiana Karafet (University of Arizona), Turi E. King (University of Leicester), Peter de Knijff (Leiden University), Arpita Pandya (University of Oxford), Alan Redd (University of Arizona), Fabrício R. Santos (University of Oxford and Universidade Federal de Minas Gerais), Chris Tyler-Smith (University of Oxford), Peter Underhill (Stanford University), and Elizabeth Wood (University of Arizona). Mark Thomas (University College London) provided information on the order of the M17/SRY10831b mutations.

Cell Lines

Luca Cavalli-Sforza (Stanford University), Nathan Ellis (Memorial Sloan-Kettering Cancer Center), Michael Hammer (University of Arizona), Trefor Jenkins (University of Witwatersrand), Judy Kidd (Yale University), Ken Kidd (Yale University).

Nomenclature Committee

Peter Forster (McDonald Institute for Archaeological Research), Michael Hammer (University of Arizona), Matthew E. Hurles (McDonald Institute for Archaeological Research), Mark A. Jobling (University of Leicester), Peter de Knijff (Leiden University), Chris Tyler-Smith (University of Oxford), Peter Underhill (Stanford University).

Helpful Discussions

Stephen Zegura (University of Arizona), Matthew Kaplan (University of Arizona).

This work was supported by grants from the National Science Foundation (OPP-9806759) and the National Institute of General Medical Sciences (GM53566) to MH; from the NIH (GM28428 and GM55273) to PAU and LCS, from the BBSRC to AP; from the Leverhulme Trust to FRS; and from the CRC to CTS. MAJ is a Wellcome Trust Senior Fellow in Basic Biomedical Science (grant number 057559). The Y Chromosome Consortium thank Colin Renfrew and the McDonald Institute for Archaeological Research for running a workshop attended by the members of the nomenclature committee at which many issues were resolved in a collaborative spirit.

The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 USC section 1734 solely to indicate this fact.

Acknowledgments

The YCC wishes to thank the many people involved in this collaborative project. Following is a list of many of the contributors to this project and sources of funding.

Footnotes

  • 1 See Acknowledgments for list of Consortium members.

  • Corresponding author: Michael Hammer, Department EEB, Biosciences West, University of Arizona, Tucson, Arizona 85721, USA.

  • E-MAIL mhammer{at}u.arizona.edu; FAX (520) 626-8050.

  • Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.217602.

    • Received October 4, 2001.
    • Accepted December 4, 2001.

REFERENCES

| Table of Contents

Preprint Server