The Ras superfamily is a fascinating example of functional diversification in the context of a preserved structural framework and a prototypic GTP binding site. Thanks to the availability of complete genome sequences of species representing important evolutionary branch points, we have analyzed the composition and organization of this superfamily at a greater level than was previously possible. Phylogenetic analysis of gene families at the organism and sequence level revealed complex relationships between the evolution of this protein superfamily sequence and the acquisition of distinct cellular functions. Together with advances in computational methods and structural studies, the sequence information has helped to identify features important for the recognition of molecular partners and the functional specialization of different members of the Ras superfamily.
Both unicellular and multicellular organisms respond to cues expressed by other cells. In metazoans, studies of intercellular signaling during development have revealed the existence of highly conserved signaling pathways. Cellular organization and signaling is heavily influenced by the Ras superfamily of small GTP-binding proteins, which maintain a structurally and mechanistically preserved GTP-binding core despite considerable divergence in sequence and function. These GTP-binding proteins share a common enzymatic activity, producing GDP by the hydrolysis of GTP.
Ras superfamily signaling is dependent on the binding of specific effectors. Thus, minor modifications in sequence, structure, and/or cellular regulation of members of the superfamily will affect binding to regulators and consequently cell signaling. Accordingly, an important goal in studies of Ras superfamily signaling is to identify the determinants of these specific associations. The relationships between Ras superfamily proteins and their effectors have been analyzed using distinct phylogenetic approaches (Li et al., 2004; Jiang and Ramachandran, 2006; Boureux et al., 2007; Langsley et al., 2008; Mackiewicz and Wyroba, 2009; van Dam et al., 2009). To elucidate the influence of sequence evolution on Ras superfamily signaling, we have analyzed complete (or almost complete) genomes representing crucial evolutionary time points, focusing on the phylogenetic inferences gained from both species and protein trees. Using this information, we have generated a representative tree reflecting the evolutionary history of the Ras superfamily, from which we can classify the human Ras sequences. By adopting this approach to study the functional specificity of different superfamily members, we have been able to integrate the mechanistic information derived from these species and protein trees within a structural framework.
To facilitate the reading of this work we have used the following nomenclature: Ras superfamily refers to the highest organizational level that includes different protein families. Ran, Ras, Rab, Rho, and Arf refer to the specific protein families. RAS, RHO, etc. (capitalized) denote specific proteins. The G-domain refers to the structural domain common to proteins of the Ras superfamily.
The Ras superfamily
The Ras superfamily is divided into five major families: Ras, Rho, Arf/Sar, Ran, and Rab. Members of the Ras family function as signaling nodes that are activated by diverse extracellular stimuli and that regulate intracellular signaling. This signaling ultimately controls gene transcription, which in turn influences fundamental processes such as cell growth and differentiation. The human oncogenic members of the Ras family have been reviewed extensively (Karnoub and Weinberg, 2008), and in general they regulate cell proliferation, differentiation, morphology, and apoptosis. The Rho family is involved in signaling networks that regulate actin, cell cycle progression, and gene expression. In addition to cytoskeletal organization (Heasman and Ridley, 2008) and cell polarity (Park and Bi, 2007), members of the Rho family have recently been implicated in hematopoiesis (Mulloy et al., 2010), particularly the RAC protein that is involved in both canonical and noncanonical wnt signaling (Schlessinger et al., 2009). The Rab family participates in vesicular cargo trafficking and it is by far the largest family of the Ras superfamily. Gene duplication has resulted in a large expansion of this protein family, as witnessed by the presence of duplicates in all vertebrate genomes. Rab family proteins regulate intracellular vesicular transport and the trafficking of proteins between different organelles via endocytotic and secretory pathways (Zerial and McBride, 2001). These proteins facilitate budding from the donor compartment, transport to acceptors, vesicle fusion, and cargo release. A key feature of the Rab family is the distinct intracellular distribution of its different members (Stenmark, 2009). By contrast, only one member of the Ran family is found in all eukaryotic lineages, with the exception of plants, which contain several copies. RAN proteins are the most abundant in the cell and they are involved in nuclear transport. Finally, the Arf family of proteins comprises the most divergent proteins, which, like Rab family proteins, are involved in vesicle trafficking (Wennerberg et al., 2005). These proteins signal through a wide range of effectors, including coat complexes (COP, AP-1, and AP-3) and lipid-modifying enzymes (PLD1, phosphatidylinositol 4,5-kinase, and phosphatidylinositol 4-kinase).
Phylogeny of Ras superfamily proteins
The most recent phylogenetic reconstruction of the Ras superfamily was based on sequences obtained from the complete draft of the human genome (Wennerberg et al., 2005). The resulting tree confirmed the general organization of five families (Ras, Rab, Rho, Arf/Sar, and Ran) and pointed to the Ras family as the root of the superfamily. Previous comparisons between human, fly, yeast (Garcia-Ranea and Valencia, 1998), and plant species revealed a similar organizational structure (Li et al., 2004; Jiang and Ramachandran, 2006). However, the recent sequencing of the complete genomes of additional species now enables us to reanalyze the Ras superfamily over a broader phylogenetic range, thereby increasing the likelihood of tracing the origin of the superfamily and correctly classifying sequences that are otherwise difficult to handle. Indeed, analyzing large sets of orthologous sequences is commonly recognized as the best strategy to improve the quality of phylogenetic analysis (Nei and Kumar, 2000).
The second important reason to reassess the organization of the Ras superfamily is the recent availability of novel techniques to build multiple sequence alignments (MSAs) and trees (Kemena and Notredame, 2009), key elements in phylogenetic reconstruction (Phillips et al., 2000).
The generation of accurate MSAs is a key step in the tree-building process, which is based on estimating the similarities between sequences. The quality of multiple alignments is highly dependent upon the degree of similarity of the sequences to be aligned. Although standard methods work reasonably well when the sequence similarity is high (over 40%), very divergent sequences are difficult to align and the MSAs often contain errors (Kemena and Notredame, 2009). Thus, we used a well-established method to construct MSAs and corroborated the results using two additional independent approaches. The main alignment of 919 G-domains (Table S1) was built using the most recent version of HMMER (HMMERV3.0; Eddy, 2009) and it underwent detailed manual curation to detect incorrectly aligned regions. A profile was constructed that contained the statistical features of the amino acids occupying each position in a G-domain seed alignment that was derived from several Ras superfamily proteins obtained from the PFAM database (identifier PF00071; Finn et al., 2010). This profile was used as a template to align the other Ras superfamily sequences, optimizing their correspondence through the probability of each amino acid type being located at each position in the profile. The resulting alignment was compared with alignments obtained using two other methods known to perform reliably in cases of poor sequence similarity (MAFFT: Katoh et al., 2009; and T-COFFEE: Notredame, 2010). These methods confirmed the essential aspects of the alignment (unpublished data).
It should be noted that although various members of the Ras superfamily contain additional protein domains (e.g., BTB domains, N-terminal anchoring regions; see Figs. S2 and S5), our phylogenetic analysis corresponded exclusively to the G-domain, which contains the basic functional and historical core that is common to the superfamily.
The MSA was used as the basis for the phylogenetic analysis. Phylogenetic reconstruction based on protein sequences presents significant challenges that remain to be fully solved (Cavalli-Sforza and Edwards, 1967; Page and Holmes, 1999). Currently, the best approach involves the integration of thousands of trees using an appropriate statistical framework capable of handling the associated probabilities (based on Bayesian statistics and inference; Holder and Lewis, 2003; Ronquist and Huelsenbeck, 2003; Lartillot et al., 2007). The final tree represents the consensus of thousands of carefully chosen independent trees obtained after detailed matching of the probabilities of multiple combinations of branches. This final tree is the most probable representation of the evolutionary history from a statistical viewpoint (Holder and Lewis, 2003).
The statistical properties derived from evaluating thousands of alternative trees are reflected in the values associated with the tree branches. These values represent the statistical confidence in the grouping of the sequences under that branch point, i.e., the probability of that particular grouping of sequences being correct.
The trees presented were generated with MrBAYES3, the most recent implementation of tree-building methods based on Bayesian statistics (Ronquist and Huelsenbeck, 2003) and that which is considered to be the best in the field (Hall, 2005; Gaucher et al., 2010). The downside of the increased accuracy of these new methods is their high computational demands. In the case of the Ras superfamily in particular, building trees from a starting alignment of more than 900 divergent sequences is not feasible, even for large supercomputers (Wang et al., 2011). Thus, we used an alternative procedure that selects key organisms and representative sequences to build independent trees for each of the five distinct families of the Ras superfamily (Ras, Rho, Rab, Ran, and Arf).
Compiling the Ras superfamily.
The current study is based on Ras superfamily proteins directly related to the human Ras proteins, as compiled by Wennerberg et al. (2005). The updated human Ras superfamily contains 167 human proteins: 39 Ras proteins, 30 Arfs, 22 Rhos, 65 Rabs, and 1 Ran family sequence (Table S2). This list includes 10 “unclassified” sequences and for 5 of these sequences, there is only evidence that they exist at the transcriptional level (Table S2, listed as “Unclassified”).
Orthologous sequences correspond to genes separated by species divergence, as opposed to paralogous sequences that are generated by gene duplication. Using the dedicated InParanoid resource (v.4.0; Ostlund et al., 2010), we identified a total of 766 sequences from 11 organisms (excluding human sequences) that correspond to orthologues of the 167 human proteins in the Ras proteins superfamily (orthologues were obtained from various databases, the correspondence between identifiers is given in Tables S3 and S4). These 11 species were selected based on their correspondence to relevant moments in eukaryotic evolution (Fig. 1). Important speciation events were represented by the inclusion of Ras superfamily sequences from the corresponding genomes (Plantae-Animalia and Radiata-Bilateria by A. thaliana and N. vectensis, respectively), or the different Chordata lineages, represented by ascidians (C. intestinalis) and lancelets (B. floridae). Well-annotated genomes are available for some of the species included (e.g., H. sapiens, M. musculus, S. cerevisiae, S. pombe, A. thaliana, D. melanogaster, and C. elegans), whereas for others the genome remains poorly annotated (e.g., P. falciparum) or is only available in a draft format (e.g., N. vectensis, C. intestinalis, B. floridae, and X. tropicalis).
The set of orthologous sequences, along with the estimated point in evolution at which they have diverged, is depicted in Fig. 1. The Ras family is entirely absent from plants (A. thaliana), in which the remaining subfamilies are the signaling members of the superfamily (Yang, 2002), while no Rho family orthologues are found in alveolates (P. falciparum). Because our study of the Ras superfamily uses human sequences to retrieve the corresponding orthologues in other species, sequences from these species that are not present in humans may not have been included. Although ascidians would be expected to have a similar number of RAS proteins as humans, based on phylogenetic estimates, there is a noticeable decrease in the number of Ras superfamily orthologues for this organism, probably due to the loss of ancestral genes (Hughes and Friedman, 2005). Similar findings are obtained in coelomates and cnidarians as fewer orthologues are found than in the sea anemone and N. vectensis, although more orthologues are detected for the cnidarian than in the coelomate species. This finding is not unexpected, as gene content and genomic structure has been preserved between N. vectensis and vertebrates (Putnam et al., 2007), whereas extensive gene loss has occurred in the fruit fly and nematodes (Technau et al., 2005).
In some species additional gene-duplication events have produced an accumulation of paralogous sequences, which results in variation between species in terms of the numbers of each Ras superfamily member, as evident in the corresponding family trees (see Figs. S1–S5). For example, three copies of the Ran family sequences were detected in A. thaliana whereas only one was found in the other species analyzed (Figs. 1 and 2; Fig. S1).
Rho family proteins expanded extraordinarily in plants (Yang, 2002), and although plant RACs are homologues of RAC, RHO, and CDC42 (Fig. S2), the expansion of RAC in plants after speciation has resulted in the generation of a larger number of RAC proteins (RAC1–RAC11) than in other organisms. The ancestral duplications of RAC in fungi/metazoans led to the appearance of CDC42, which controls cell polarity, and RHO, which is implicated in cytokinesis (Jaffe and Hall, 2005). The CDC42 protein, which promotes the formation of actin microspikes and filopodia, is conserved in all the lineages except plants. Interestingly, the absence of Rho genes in alveolates suggests that other proteins fulfill its role in cell polarity and cytokinesis.
The gene duplication that generated the Ras family proteins in vertebrate genomes (H-Ras, K-Ras; group 7a in Fig. S3) is another example of how variation in the number of Ras superfamily proteins arises (Fig. S3). Indeed, although they are present in Xenopus, the evolutionarily more ancestral genomes only contain one copy (LET60 in C. elegans and RAS1 in D. melanogaster) that is involved in embryogenesis (Ezer et al., 1994). Fungal orthologous sequences (Fig. S3) are only found for RHEB (group 12), RAP (group 4), and RRAS (group 8). Further duplications of these sequences after speciation yielded the MRAS and KRAS groups. Together, these data are consistent with major gene duplication in vertebrates (Kondrashov et al., 2002).
The vertebrate branch of the Rab protein family has expanded considerably (Fig. S4; see also Mackiewicz and Wyroba, 2009) and significantly. We found representatives of each of the different groups of Rab proteins in all lineages, indicating that the appearance of this family was an important evolutionary event. The implication of Rab genes in a complex network of vesicular trafficking events suggests a relationship with multi-cellularity that merits further investigation. The Arf family of proteins (Kahn et al., 2006) is the most divergent member of the superfamily and it is associated with recurrent duplication events (Fig. S5).
Representative phylogenetic tree of the Ras superfamily.
As it was not feasible to generate a full tree with all the sequences orthologous to the proteins of the human Ras superfamily, we generated independent trees for each of the protein families using the sequences from the 12 proteomes (see Figs. S1–S5) and for each of the species. This information was then used to select the sequences that best represented the diversity of species and branches in the tree.
Specific criteria for sequence inclusion were applied to select stable and representative groups from each tree. Thus, information regarding the function of the sequences selected was necessary, as well as that related to any clear orthologues in each of the species analyzed. The selection aimed to respect the variability of the sequences included in each tree (see Figs. S1–S5, which show the groups selected to build the representative tree). For instance, as the Ras family is not represented in plants and alveolates, we selected groups with at least a yeast homologue. Moreover, because the RHEB proteins (Fig. S3, group 12) constitute the most basal branch present in all the organisms of interest, and RAP1 proteins (Fig. S3, group 4) correspond to the most basal group with a yeast orthologue, these two groups were chosen to represent the Ras family in the representative tree.
Given the large number of sequences in the Rab family, we selected two stable groups that covered the phylogeny of the whole family (Fig. S4, stars). RAB7 was chosen on the basis of its well-characterized involvement in Golgi late endosomal transport. RAB7 lineages emerged before the divergence of plants (Mackiewicz and Wyroba, 2009) and this protein is present in amebozoans and ciliates. Analogously, RAB1 (Fig. S4, stars) was selected due to its position in the tree in all species. All the Ran family sequences were included. For the Arf family (Fig. S5, stars) we included the ARD-1 (also known as TRIM23, an Arf-related protein involved in ubiquitination; Mishima et al., 1993) and SRPRB (signal recognition particle receptor subunit β) proteins. These groups cover the phylogenetic range of this family.
Difficult sequences (i.e., those not classified by Wennerberg et al., 2005) were also included for those species in which orthologues corresponding to human sequences could be identified. In addition, we included sequences from the elongation factor Tu (EFTU) family, a distant relative with a G-domain that was used as a guide to situate the root of the Ras superfamily tree. The inclusion of distant sequences (known as outgroups) to define the tree is an accepted procedure to trace the origin of protein families (Nei and Kumar, 2000).
The superfamily tree (Fig. 2) includes 165 sequences, of which 22 are human (Fig. 2, underlined sequences). The remaining sequences belong to 11 species, and 3 additional non-Ras superfamily sequences are included. The first observation is the clear separation of the five main families, which group together with a high degree of confidence. Indeed, of the thousands of independent trees constructed using this procedure, over 80% exhibited the same general tree structure. The clear separation between families is a distinctive feature of the Ras superfamily that is not commonly observed in other large superfamilies, such as protein kinases.
The inclusion of an outgroup (in our case, the EFTU proteins) appears to clarify the overall topology of the superfamily. This new phylogeny places the SRPRBs (Fig. 2, group 1) as the founder members of the Ras superfamily. There is significant support for this structure, even though it differs from that described in previous analyses based on human Ras superfamily sequences alone (Wennerberg et al., 2005). The SRPRB protein is located at membranes and it is an essential component of the signal recognition receptor, which ensures that nascent secretory proteins are targeted to the endoplasmic reticulum (ER) membrane system. This suggests that the original function of the superfamily is related to specific signaling events involving membrane structures.
The human-only Ras superfamily phylogenetic tree.
Our most up-to-date representation of the human sequences tree is shown in Fig. 3. As discussed in the previous paragraph, generating an accurate tree of the human sequences requires that their relationship with sequences from other species be taken into account. Thus, we used the topology of the complete species tree represented in Fig. 2 to organize the rest of the human sequences. Additionally, SRPRBs were used to root the human tree based on the information from the representative tree.
Of the 167 defined sequences (Table S2), some genes present alternative splice variants with identical sequences in their G-domains. As we only compared the precise G-domain region, these identical sequences were removed before alignment. The remaining 151 human Ras superfamily sequences were aligned and a phylogenetic analysis was performed following the procedure described in the previous section. The main families were divided into stable groups, as evident by the high confidence values (Fig. 3, gray dots). It is clear that this superfamily has experienced extensive gene duplication, with the Rab family representing the most abundant family in humans (Fig. 3, red background). Previous phylogenetic analyses (Schwartz et al., 2007) divided the Rab family into eight functional groups (Pereira-Leal and Seabra, 2001), whereas a more recent study proposed nine groups (Stenmark, 2009). Incorporating our data with the corresponding subcellular localization described previously (Stenmark, 2009), we can generate a phylogenetic distribution of the Rab repertoire, as shown in Fig. S4. The colored branches represent the functional family to which each particular protein belongs (Stenmark, 2009), and the numbers in brackets represents the previous groupings (Schwartz et al., 2007), in which 14 divisions were established. We found some discrepancies (Fig. S4, group numbers labeled with an asterisk) with the earlier phylogeny proposed (Schwartz et al., 2007), probably due to the inclusion of additional species in our analysis. For instance, group number 10 (Fig. S4, beige branches) contains RAB18, which is traditionally assigned to an independent family (Stenmark, 2009). However, in our analyses RAB18 was grouped within Stenmark’s “RAB3” group (Fig. S4, red branches). A similar situation occurred with “RAB1” (dark brown) and “Rab 28” (dark orange).
A previous phylogenetic analysis of the Arf family described 11 groups: the Arfs, ARL1-6, ARL-8, ARL10/11, ARFRP, and SAR (Li et al., 2004). However, this study included none of the ARD proteins (Arf-like proteins also known as the TRIM23 group; Kahn et al., 2006). We expanded the phylogenetic analysis to include the ARD (TRIM23) proteins, which in our analysis had a high probability of grouping with the ARF proteins (a significant confidence value of the corresponding tree branches). Interestingly, this group contains multidomain proteins with a Ring domain (SMART), a protein interaction domain that shows E3 ubiquitin–protein ligase activity, Zf_boxes that are also protein interaction domains at the N-terminal region (PFAM PF00643), and an ARF domain similarity region in the C-terminal region. The presence of these domains may point to specific functions related to ubiquitination and binding to targets such as DNA, RNA, and proteins.
The analysis of sequences from species not previously included in other studies allowed us to reassess the earlier classifications. For instance, although the NKIRAS (KBRAS) protein (Fig. S3, group 1) was believed to be human specific (Jiang and Ramachandran, 2006), our results indicate that this protein is present in all the eukaryotic lineages except fungi and Plasmodium.
The overall classification of the family including information from divergent species may help to elucidate the role of some of the less well-known sequences in the superfamily. For example, the analysis reveals a putative relationship between the MIRO and RAYL sequences within the Rho family, despite the fact that MIRO proteins are considered to be an independent family as they regulate mitochondrial rather than cytoskeletal dynamics (Colicelli, 2004). The position of these proteins within the Rho family (Fig. 2, group 8) suggests that some functional diversification (sub-functionalization) has occurred, although this may also point to a common original mode of action that was later co-opted to perform distinct cellular roles.
The present findings provide some insight into the potential functions of some of the superfamily proteins, for instance Rab-like proteins (named “like” because they are similar to Rab) that lack the conventional C-terminal modification site. Although the role of the RABL5 protein remains a mystery, RAYL protein is thought to be a cell cycle–related protein (Qin et al., 2007), consistent with its phylogenic placement with CDC42 proteins (Fig. 2, group 6). RABL2A/B (Kramer et al., 2010) are Rab-like proteins mapping to the subtelomeric region of chromosome 22q13 (Wong et al., 1999), suggesting that in humans at least, duplicate genes actively express proteins of as yet unknown function in telomeric regions. Curiously, in our phylogeny these proteins grouped with the Ran protein family (Fig. 2, groups 10 and 9). Because the RABL2 proteins are metazoan specific, they may have originated from duplications of Ran proteins after speciation events. The function of RABL3 is also unknown, although it was recently implicated in regulating the proliferation and motility of human cancer cells (Li et al., 2010). Future analysis of additional sequences may clarify certain aspects of this classification.
A particularly interesting case is that of RAG proteins (Sekiguchi et al., 2001) that are believed to be part of the Ras superfamily based on the presence of a GTP domain. Although their structure and overall sequence similarity shows that they contain a G-domain, we found that the lack of conservation and the unusual composition of the GTP-binding sites, otherwise conserved in the superfamily, sheds doubt on their classification as a family of the Ras superfamily. Future studies of this group of proteins using a larger set of sequences from new complete genomes will be required to confirm their classification.
Functional specificity of the proteins in the Ras superfamily
The large functional diversity of the Ras superfamily is perplexing. The conserved 3D structure of the G-domain that is common to the entire superfamily allows these proteins to preserve large structural similarity and common biochemical properties while they recognize their individual binding partners with remarkable affinity and specificity (Colicelli, 2004; Wennerberg et al., 2005). The promiscuity and diversity in this superfamily is illustrated by the multiple upstream (regulators) and downstream proteins to which Ras superfamily proteins bind (Bishop and Hall, 2000; Karnoub and Weinberg, 2008: Table S5). The list of these interacting proteins is continually growing and a plethora of functions have been attributed to both effectors and regulators. Moreover, some GTPases share effectors despite performing distinct functions, leading to another level of regulation within this family (Kiel et al., 2007; Barnekow et al., 2009).
The GTP-binding site is made up of a core of essential residues that also participate in the conformational changes linking GTPase activity to effector binding. The specific distribution of these residues and the differential sequence conservation within families determines the specificity of the association between Ras superfamily proteins and their effectors, interactions that ultimately determine the biological activity of the protein. It is important to note that although structural and sequence-specific features are clearly correlated with function, other factors work together to influence the network of Ras superfamily interactions, like gene expression of the proteins, and the regulation and acquisition of domains that determine subcellular localization (Rodriguez-Viciana et al., 2004; Goldfinger, 2008).
We compared and contrasted the families in the Ras superfamily to identify the residues in the G-domain and to determine any differences that may underlie their specific interactions with effectors. These positions are considered specificity-determining positions (SDPs), as they provide information regarding the branching of the phylogenetic tree (del Sol et al., 2003), also known as tree determinants, and they are associated with ligand-binding sites and protein interaction regions (Rausell et al., 2010). We analyzed the Ras superfamily to identify family-specific residues, using the complete sequence alignment of 919 G-domains, including representatives from each family of the 12 genomes. This analysis was performed using a recently described unsupervised approach (Rausell et al., 2010) that is based on multiple correspondence analysis. This technique represents the sequences of the alignment as vectors (Fig. S6), and their organization is optimized into groups according to their similarities and differences. The groups of sequences resulting from the association of similar vectors with a k-means algorithm are equivalent to the branches of the phylogenetic tree. The advantage of this procedure is that it allows the characteristic residues (SDPs) that dictate the organization of the sequences into groups to be extracted directly. In this way the MSA is transformed into a catalog of groups of sequences with their associated characteristic amino acid and positions.
For example, the sequences corresponding to the Rho, Rab, and Ran families have a characteristic conserved Asp residue within the G1 motif (position 7 in the alignment, corresponding to residues 13, 16, and 18 for the Rho, Rab, and Ran structures indicated in Table 1, respectively). By contrast, the Arf family has a Leu residue in this motif (position 25 for Arf PDB:1HUR), whereas Ras family sequences are not highly conserved (position 11 for RAS PDB:121P). Completely conserved residues (i.e., those binding GTP) cannot be assigned to any particular group of sequences (for more details on this method, see Rausell et al. ).
The summary of sequence features is presented in Table 1 and Fig. 4. The analysis of SDPs provided independent evidence for the classification of RABL3, RABL5, RAYL, MIRO, and SRPRB proteins (Tables S7 and S8). Significantly, the SDP pattern of these proteins did not correspond well with the characteristic amino acids in the classical families (Table S8). In particular, the conservation of the RABL3, RABL5, and SRPRB sequences reflects important differences at key positions within the groups to which they are assigned, and this is more consistent with their organization as independent groups and it is in agreement with the rooted phylogenetic tree. By contrast, the low intensity SDP signal in the RAYL and MIRO groups (Fig. S7) suggests that they constitute a peculiar group within the Rho family. Additional genomic information will be required to further study this group.
To further advance our analysis, we focused on the interaction of these SDP residue proteins with the nucleotide-binding pocket, their association with specific binding partners, and their role in communicating between the GTP-binding site and other functional areas of the Ras superfamily (Fig. 5).
SDPs involved in GTP/GDP hydrolysis.
Ras superfamily proteins generally undergo an enzymatic cycle that involves the so-called loaded-spring mechanism, whereby the switch regions relax after release of the hydrolyzed γ-phosphate in the active state, adopting an open-inactive GDP conformation (Vetter and Wittinghofer, 2001). In terms of their 3D structure, the G1–G5 loops form the nucleotide-binding site, with an interface that is responsible for nucleotide specificity and affinity, and that regulates GTP hydrolysis. The SDP residue Val14 reportedly forms hydrogen bonds with the phosphate groups of the nucleotides (Tong et al., 1991). This residue, together with Ala11 and Pro34, is located in regions characterized as hinges that act in the conformational transition between GTP and GDP forms (Díaz et al., 1995, 1997; Futatsugi and Tsuda, 2001), thereby modulating GTP hydrolysis (see Table 1 for the equivalence of these residues in the various families). The residues in switch II (Thr58 and Ala59) may influence nucleotide cycling, in which the movement of neighboring side chains plays a key role, as demonstrated by x-ray diffraction, NMR, spectroscopy, and MD studies (Noé et al., 2005; Gorfe et al., 2008; Lukman et al., 2010). Thus, mutating these residues may alter the structure or dynamics of the system, favoring either active or inactive states, as occurs with oncogenic mutations such as G12V (Gorfe et al., 2008).
SDPs involved in interactions with different binding partners.
Ras superfamily proteins are very specific when transmitting signals to their partners, as illustrated by the extent to which single amino acid changes can alter their individual specificities (Stenmark et al., 1994; Azuma et al., 1999; Bauer et al., 1999; Karnoub et al., 2001). As a general rule, conformational changes induced by nucleotide states are transmitted to the switch I and switch II regions, where they are recognized by the corresponding effectors (Vetter and Wittinghofer, 2001). In addition, the interacting interface is typically comprised of other residues outside the canonical switch regions (Heo and Meyer, 2003; Fuentes and Valencia, 2009), as confirmed by the structures of the Ras–effector complexes (Corbett and Alber, 2001).
To define the SDP residues involved in protein–protein interactions, we inspected the interacting surfaces of a set of complexes within the Ras superfamily using a previously described method (Corbett and Alber, 2001; Biou and Cherfils, 2004; Dvorsky and Ahmadian, 2004; see Table S5 and Fig. S8). It should be noted that the information pertaining to binding surfaces is necessarily incomplete, due to the limited structural characterization of Ras superfamily protein-interacting partner complexes. We considered the static features of the structures, as well as the potential plasticity of neighboring residues (Sprang, 2000). This plasticity of interacting and neighboring residues is important for the binding of Ras superfamily proteins to different partners, and in distinguishing related members within a family (Cherfils et al., 1997).
We analyzed the distribution of SDP residues (Table 1) at the interfaces that interact with both GEF and GAP proteins (Fig. S9). In the case of GAPs, most Ras superfamily proteins interact by means of at least one residue at switch I, which occupies position 85 in the full sequence alignment (Table S6). However, the Rab protein family is an exception as it uses a completely different set of interacting residues, possibly related to other interactions in the context of larger complexes (Goldberg, 1998). Both Rho and Ran families selectively use residues located in their G1 box (residues 13 and 18, respectively), in switch I (residue 39 of RHO), and the β-sheet adjacent to switch II (residue 91 of RAN; Fig. S9).
In the case of binding to GEFs, the binding surface forms an extended patch at the surface and the interactions differ for each family. Residues at position 94 in the full sequence alignment (Table S6) are located at the C terminus of switch I, and they are common to the Arf and Ras families (residues 51 and 37, respectively). Residues at the N terminus of switch II (residue 64 in RAN and 65 in RAB for the PDB entries 1I2M and 2FOL, respectively) form part of the recognition site. The residue at position 85 in the alignment corresponding to residues 47, 34, 36, 39, and 41 for the Arf, Ras, Rho, Rab, and Ran structures, respectively (as indicated in Table 1), and which corresponds to G2 and switch I, is found in the sequences of both Ras and Rho families. The residue at position 219 in the C terminus of the switch II region is shared by proteins of the Ras and Ran families (position 219 in the alignment, corresponding to residues 68 and 76 in Ras and Ran structures, respectively, as indicated in Table 1).
SDPs involved in communication between GTP and membrane-binding regions.
The extension of Arf, Arl, and Sar proteins at the N terminus is required for membrane anchoring (Pasqualato et al., 2002), also serving as a family-specific switch (Goldberg, 1998). Interestingly, recent studies highlighted a more general switch mechanism, involving communication between the membrane-binding domain and the GTP-binding site (Abankwa et al., 2008; Gorfe et al., 2008). Based on sequence variation, two different lobes have been defined in the canonical RAS structure: lobe 1 (comprising the first 86 conserved residues), which lines up the GTP/GDP pocket; and lobe 2 (residues 87–171), which exhibits significant sequence variability and that is associated with Ras anchoring to the membrane (the lobes are shown in dark gray and pale gray in Fig. 5 A). In the 3D structure, different routes of communication between these two regions have been proposed for different isoforms (Gorfe et al., 2008). To characterize residues that mediate the communication between the two lobes, we mapped the proposed GTP–membrane communication routes for the three RAS isoforms onto the canonical G-domain, based on the specific pairwise contacts for each isoform and the location of the SDP residues (Fig. 5 A). Three specific SDP residues may transfer the conformational changes required for the correct transmission of the biological signal. In the 3D structure of RAS, residues Val81 and Ala83 are located in a buried β-sheet, which is sandwiched between the two lobes that coordinate the conformational changes involved in this back-to-front communication mechanism. By contrast, Ala121 is found in a loop that is thought to weave different spatial motifs into the structure, such as the preceding α3 and α5 loops (Fig. 5 A). Although no data are available for the members of other families, the similarity of the SDP patterns suggests that similar routes connect the two regions (see Table 1 for the equivalent residues in the other families).
Based on the location and biological implications of the SDP residues, we conclude that the specificity of the small G protein structural module is characterized by a “canonical” nucleotide switch, multiple specific interactions, and communication with the nucleotide–membrane region. A precise balance between a rigid, high-affinity conformation and conformational flexibility is required to create such an efficient and stringent molecular switch, which may involve residues specific to individual proteins as well as distinct families.
New genomic information and the improvements to phylogenetic tools have further advanced our understanding of the structure and organization of the Ras protein superfamily. These advances are evident when comparing the current superfamily tree with that initially proposed 20 years ago (Valencia et al., 1991). It is now clear that the separation of the five main families (Ras, Rho, Arf/Sar, Ran, and Rab) was an early evolutionary event that predated the expansion of eukaryotes. Although it was believed that the founding members belonged to the Ras family, our comprehensive phylogenetic analyses of selected, well-defined members of each family in representative species, using EFTU sequences as an outgroup, points to the SRPRB proteins and the Arf family as possible founding members. This implies that the original function of these proteins may have been related to the regulation of membrane trafficking in eukaryotic cells (Munro, 2005), a process potentially linked to the emergence of complex intracellular structures. The presence of representative sequences of each family in the selected genomes indicates that divergence occurred before the emergence of eukaryotes, and strongly suggests that this superfamily expanded very early to generate the functionally distinct families. It is tempting to propose that this ancestral diversification is related to the increasing complexity of intracellular eukaryotic structures. This hypothesis merits further investigation and such studies will require additional information from genomes of more distant species, more precise functional data, and possibly new computational methods.
The general phylogenetic reconstruction proposed here fits well with the known functional properties of the individual families. In some cases, early functional divergence may be related to divisions such as that of the Ran family sequences, which represent a sub-functionalization of Rab to properly achieve nuclear transport capabilities. A number of other observations relating to specific groups in some species, such as the absence of the entire Ras family in A. thaliana, or the loss of genes in coelomates (Technau et al., 2005) and their ascidian vertebrate sisters (Hughes and Friedman, 2005), illustrate the complex gene loss and gain in different lineages (Kondrashov et al., 2002). More functional studies will be necessary to marry the details of functional and evolutionary divergence.
Our review of the phylogenetic classification of the Ras superfamily has provided useful data relating to the classification of a set of more divergent members. According to our analyses, some of these members may be placed in the traditional families, as seen in our reassigning of RAYL and MIRO proteins to the Rho family, and the RAB20 and RABL2A/B proteins to the Rab family. Based on the information available, other sequences such SRPRBs and RABL5/3 can be tentatively classified as independent families.
Phylogenetic approaches such as those described here can further our understanding of the relationships between proteins and protein groups. However, the assignment of function based on orthologous relationships should be approached with caution, and ultimately such facets should be confirmed experimentally.
The protein functions proposed here based on experimental data and phylogenetic inference can be complemented by a combination of structural analyses, and by determining the key residues in functional regions and in the mechanisms adopted by the distinct families (see Pymol Scripts in the Online supplemental material for the mapping of SPDs to structures). Within this new framework, such an approach could identify molecular landmarks that underlie the specific functional differences between families (i.e., the presence of residues such as Val81 and Ala83 in the core of the protein [PDB entry 121P]; or the presence of residues correlated with functional movement, such as Ala121 in the loop regions). These residues may be implicated in signal transmission between GTP-binding sites associated with the GTP state of the protein and the membrane-anchoring regions (N terminus of ARF, ARL, and SAR proteins in the Arf family, and the C terminus in the Ras and Ran families). Similarly, positions 156, 164, and 219 (numbered according to the full sequence alignment and mapping to the switch II region; see Table S6 for correspondence with the representative structures of each family) may be responsible for the recognition of family-specific binding partners (e.g., Arg68 and Arg76 mediate the association of RAS and RAN with GEFs, respectively; Table 1). Positions around the GTP-binding site, such as residues 13 and 18 in Rho and Ran members, respectively, appear to determine the specificity toward GAPs, and to influence the dynamic features of the GTP/GDP-binding pocket. Accordingly, these residues regulate the GTP hydrolysis/exchange mechanisms in each of these families.
Our understanding of the Ras superfamily will be enhanced by future developments in this field, including the incorporation of new complete genomes, elucidation of the structures of Ras superfamily protein–effector complexes, and biophysical studies of signal transmission between GTPase and effector binding sites in individual families. Similarly, a new generation of phylogenetic methods that can accurately organize larger numbers of sequences and refined bioinformatics approaches for the study of structure–sequence relationships will advance our understanding of protein evolution and function.
Online supplemental material
Supplemental figures, tables, and a .zip file that contains pymol scripts are available at http://www.jcb.org/cgi/content/full/jcb.201103008/DC1.
We are indebted to Fred Wittinghofer and Anne Spang for their critical reading of the manuscript.
This work was supported by two grants from the Spanish Ministerio de Ciencia e Innovación: BIO2007-66855-E.00408 and FIS PS09/02111.