Creating a data file for GeneTree

One of the things new users of software often find confusing is how to get data into the program, and GeneTree is no exception. This tutorial explains how to do it using the phytochrome data of Donoghue and Matthews. To create the file you will need a text editor, such as the one that comes with GeneTree (go to the File menu and select New.

Step 1: Get a gene tree

Construct a gene tree from Donoghue and Matthew's data set. The data are available in TreeBase with the accession number S3x10x98c09c42c47. Load the data into PAUP and do a parsimony search. You will obtain one tree of 15344 steps in length. Save this tree. For convenience set the outgroup to taxa 1 and 2, rooting option to outgroup monophyly, and save the tree with branch lengths (this is merely for display purposes).

Step 2: Get an initial species tree

GeneTree needs a fully resolved species tree as input. This tree can be arbitrary if you want to search for optimal species trees, but typically you might want a reasonable estimate of the species phylogeny. One source is the NCBI Taxonomy database.

A simple way to create the species tree is using TreeView. If you list the names of the species separated by commas, then enclose the list in a pair of parentheses, you have a "star tree:"

(Physcomitrella_patens,Ceratodon_purpurea,Selaginella_martensii, Adiantum_capillus,Psilotum_nudum,Picea_abies,Pseudotsuga_menziesii, Arabidopsis_thaliana,Cucurbita_sativa,Petroselinum_crispum, Nicotiana_tabacum,Solanum_tuberosum,Glycine_max,Pisum_sativa, Avena_sativa,Oryza_sativa,Zea_mays,Sorghum_bicolor, Pinus_sylvestris,Ipomoea_nil);

You can paste this into TreeView and use that program's tree editor to construct the species tree. Once you are happy with the topology, you can cut and paste it into the text editor. The species tree belongs in the TREES block:

BEGIN TREES;
	TREE  start = ((Ceratodon_purpurea,Physcomitrella_patens),(Selaginella_martensii,
		(Adiantum_capillus,(Psilotum_nudum,(((Picea_abies,Pseudotsuga_menziesii),
		Pinus_sylvestris),((Petroselinum_crispum,(Ipomoea_nil,(Solanum_tuberosum,
		Nicotiana_tabacum))),(((Oryza_sativa,(Zea_mays,Sorghum_bicolor)),
		Avena_sativa),((Cucurbita_sativa,Arabidopsis_thaliana),(Pisum_sativa,
		Glycine_max)))))))));
ENDBLOCK;

Step 3: Associate each gene with a taxon

The third piece of information GeneTree needs is a list of which gene came from which species. This information belongs in the DISTRIBUTION block, together with the gene tree from step 1.

BEGIN DISTRIBUTION;
NTAX=30; [ 30 sequences ]
RANGE
	[ sequence name ]	      [ taxon ]
	Physcomitrella_patens 	  : Physcomitrella_patens,
	Ceratodon_purpurea 	  : Ceratodon_purpurea,
	Selaginella_martensii 	  : Selaginella_martensii,
	Adiantum_capillus 	  : Adiantum_capillus,
	Psilotum_nudum 		  : Psilotum_nudum,
	Picea_abies 		  : Picea_abies,
	Pinus_sylvestris 	  : Pinus_sylvestris,
	Pseudotsuga_menziesii 	  : Pseudotsuga_menziesii,
	Arabidopsis_thaliana_PHYC : Arabidopsis_thaliana,
	Arabidopsis_thaliana_PHYA : Arabidopsis_thaliana,
	Cucurbita_sativa_PHYA 	  : Cucurbita_sativa,
	Petroselinum_crispum_PHYA : Petroselinum_crispum,
	Glycine_max_PHYA 	  : Glycine_max,
	Pisum_sativa_PHYA 	  : Pisum_sativa,
	Nicotiana_tabacum_PHYA 	  : Nicotiana_tabacum,
	Solanum_tuberosum_PHYA 	  : Solanum_tuberosum,
	Oryza_sativa_PHYA 	  : Oryza_sativa,
	Avena_sativa_PHYA 	  : Avena_sativa,
	Zea_mays_PHYA 		  : Zea_mays,
	Arabidopsis_thaliana_PHYE : Arabidopsis_thaliana,
	Ipomoea_nil_PHYE	  : Ipomoea_nil,
	Arabidopsis_thaliana_PHYB : Arabidopsis_thaliana,
	Arabidopsis_thaliana_PHYD : Arabidopsis_thaliana,
	Nicotiana_tabacum_PHYB	  : Nicotiana_tabacum,
	Oryza_sativa_PHYB	  : Oryza_sativa,
	Solanum_tuberosum_PHYB	  : Solanum_tuberosum,
	Glycine_max_PHYB	  : Glycine_max,
	Sorghum_bicolor_PHYA	  : Sorghum_bicolor,
	Sorghum_bicolor_PHYB	  : Sorghum_bicolor,
	Sorghum_bicolor_PHYC	  : Sorghum_bicolor
	;
[ gene tree ]
TREE parsimony = ((Physcomitrella_patens:266,Ceratodon_purpurea:261):0,(Selaginella_martensii:375,(Adiantum_capillus:518,(Psilotum_nudum:263,(((Picea_abies:126,Pseudotsuga_menziesii:148):279,((Arabidopsis_thaliana_PHYC:634,Sorghum_bicolor_PHYC:568):269,((((Arabidopsis_thaliana_PHYA:413,Cucurbita_sativa_PHYA:395):120,(Petroselinum_crispum_PHYA:421,(Nicotiana_tabacum_PHYA:121,Solanum_tuberosum_PHYA:133):274):135):160,(Glycine_max_PHYA:233,Pisum_sativa_PHYA:253):177):289,(Avena_sativa_PHYA:129,(Oryza_sativa_PHYA:177,(Zea_mays_PHYA:79,Sorghum_bicolor_PHYA:41):182):104):419):308):259):237,(Pinus_sylvestris:341,((Arabidopsis_thaliana_PHYE:554,Ipomoea_nil_PHYE:480):315,((Glycine_max_PHYB:345,((Arabidopsis_thaliana_PHYB:240,Arabidopsis_thaliana_PHYD:345):300,(Nicotiana_tabacum_PHYB:100,Solanum_tuberosum_PHYB:117):273):212):201,(Oryza_sativa_PHYB:181,Sorghum_bicolor_PHYB:210):329):252):289):279):200):272):282):461);
ENDBLOCK;

Step 4: Putting it all together

The last piece of information is the list of species names in a TAXA block.

BEGIN TAXA;
	DIMENSIONS NTAX = 20;
	TAXLABELS
		Physcomitrella_patens
		Ceratodon_purpurea
		Selaginella_martensii
		Adiantum_capillus
		Psilotum_nudum
		Picea_abies
		Pseudotsuga_menziesii
		Arabidopsis_thaliana
		Cucurbita_sativa
		Petroselinum_crispum
		Nicotiana_tabacum
		Solanum_tuberosum
		Glycine_max
		Pisum_sativa
		Avena_sativa
		Oryza_sativa
		Zea_mays
		Sorghum_bicolor
		Pinus_sylvestris
		Ipomoea_nil
		;
ENDBLOCK;

Hence, the final file has this format:

#NEXUS

TAXA block (step 4)

DISTRIBUTION block (step 3)

TREES block (step 2)

You can download a copy of the finished file.

Step 5: Open the file in GeneTree

You can now save the file and run it in GeneTree. If all goes well, you will see your gene and species trees display in two windows. If you encounter errors, GeneTree will do its best to locate them in the data file for you to fix.