This Rnotebook will guide you to generate the sequence alignment and substitution rates shift pattern for each DNA element, as well as the average probability of being acceleration and conservation state per branch over all input elements. It will need:

  1. A rooted phylogenetic tree in Newick format
  2. Output files from PhyloAcc: xxx_species_names.txt, xxx_elem_lik.txt, xxx_rate_postZ_xxx
  3. (To plot sequence alignment) Concatenated alignment file (*.fasta) of all elements same as input to PhyloAcc
  4. (To plot sequence alignment) A bed file specify the position of each element within the concatenated alignment same as input to PhyloAcc
  5. All the R functions for this notebook are in drawAlign_function.R.

Read in tree data

Inputs to prepare_data are:

  1. tree_path takes a file (.mod) contains a rooted phylogenetic tree in Newick format including the topology and branch lengthes, which could be output from phyloFit (PHAST);

  2. species_name takes a file output by PhyloAcc (xxx_species_names.txt) containing the species names in the same order as the columns in xxx_rate_postZ_xxx files.

  3. (Optional) a file with common name of the species to be shown in the plots: the argument common_name is optional which takes a file with abbreviation of species name appeared in the sequences and tree files in the first column and species common name shown on the plot in the second column, and extra columns (not used in the scripts) such as full species names.

The function prepare_dat will output a list containing the tree and names etc. as input to plotZPost and plotAlign to generate plots.

treeData <- prepare_data(tree_path = "../Data/ratite/neut_ver3_final.named.mod", species_name = "../Data/ratite/species_names.txt", common_name = "../Data/ratite/birdname2.txt")

Generate evolutionary pattern and sequence alignment for one element from PhyloAcc outputs

plotZPost function will show the shift pattern of substitution rates on the phylogengy. It will need:

  1. xxx_elem_lik.txt: output by PhyloAcc containing log-likelihoods under different models for each element.

  2. xxx_rate_postZ_xxx: output by PhyloAcc having the posterior medians of conserved and accelerated substitution rates and the posterior probability of Z on each branch for different models. In the following example, we will use the result from xxx_rate_postZ_M2.txt and plot the conservation states under the full model.

#### read in BF scores as well as marginal log likelihood under null, accelerated and full model #### 
score <- read.table("../example_output/simu_500_200_npc_2-6_elem_lik.txt", header=T)

## order score by BF1
score <- score[order(-score$logBF1),]

#### read in posteriors of substitution rates and latent conservation states ####
postZ <- read.table("../example_output/simu_500_200_npc_2-6_rate_postZ_M2.txt", header=T, check.names = F)  

Select an ratite-accelerated element (e.g. logBF2 > 1 and large BF1) and plot:

sel <- which(score$logBF2 > 1)[1] # select the element with largest BF1 and BF2 > 0
lk = score[sel, ]
k = score[sel, 1] # get the No. of the selected element
targets = c("strCam","rhePen","rheAme","casCas","droNov","aptRow","aptHaa","aptOwe","anoDid") # target species
Z = unlist(postZ[postZ$No. == k, -1]) # get the posteriors of conservation states
tit = paste("logBF1:", round(lk$logBF1), "logBF2:",round(lk$logBF2), "  ") # use BF scores and posterior substitution rates as title
plotZPost(Z, treeData, target_species=targets, tit=tit, offset=5,cex.score = 2) # offset= 6 indicates the posterior of Z start from 7th column

plotAlign function will show the sequence alignment for an element as a heatmap. It will need 1) a bed file and 2) sequence alignments.

In the example bed file for simulation, the 1th column is element name; the 2nd and 3rd columns are start and end positions in the alignment for each element; the 6th and 7th columns are conserved and accelerated rates to generate the DNA sequences for that element.

bed <- read.delim("../Simulation_ratite/simu_500_200_diffr_2-6.bed", header=F)
fasta <- read.alignment(file = "../Simulation_ratite/simu_500_200_diffr_2-6.fasta", format = "fasta")  
align <- as.matrix(fasta)
align <- align[treeData$tree$tip.label,]  # reorder species in the alignment to be the same as tips of the tree. The name of the species in the alignment file has to the same as in the tree!

To plot the substitutions (as well as indels and unknown base pairs ‘N’) of the kth element across species,

plotAlign(k, align, bed, treeData, target_species=targets)