One way that programming helps us understand biology
Rob Lanfear
Ecology & Evolution Australian National University
What is phylogenetics?
actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac
! # $ %
What is partitioning?
actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac
! # $ %
actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac
Partitioning can improve inference
! # $ %
! # $ %
So whats the problem?
! # $ %
! # $ %
! # $ %
! # $ %
! # $ %
! # $ %
! # $ %
! # $ %
! # $ %
! # $ %
! # $ %
! # $ %
! # $ %
! # $ %
! # $ %
! # $ %
! # $ %
! # $ %
! # $ %
! # $ %
Lets solve the problem
A Simple Algorithm
Compare all possible partitioning schemes.
1. Calculate the AIC score of every possible partitioning scheme.
AIC (Akaikes Information Criterion) measures how far a model is from the truth.
2. Use the scheme with the smallest AIC score
https://github.com/brettc/partitionfinder/blob/master/partfinder/submodels.py
Now we find a new problem
! # $ %
! # $ %
! # $ %
! # $ %
! # $ %
! # $ %
! # $ %
! # $ %
! # $ %
! # $ %
How many schemes are there?
Number of genes (n)
Number of possible schemes (Bn)
Our first approach wont work for large datasets.
What can we do?
OPTMISE!
s1 s2 s3 s4 s5 s6 s7
s10 s3 s4 s5 s6 s7 s8 s9
Analysing EVERY scheme EVERY time is inefficient Because we analyse the same subsets over and over
To calculate the AIC of all possible partitioning schemes,
we only need to calculate the AIC of all possible subsets of genes
Then we can just add these together to get the AIC of the schemes
s8 s9
s1 s2 s3 s4 s5 s6 s7
s10 s3 s4 s5 s6 s7 s8 s9
How do we count the number of unique subsets? Combinatorics!
For example, if we start with 20 genes
There are 51,724,158,235,372 possible partitioning schemes but these are made up of just 1,048,575 possible subsets Thats a time saving of 99.99998%!!!!!
s8 s9
How many subsets and schemes?
1E+14 1E+13 1E+12 1E+11 1E+10 1E+09
1000 00000 1000 0000 1000 000 1000 00 1000 0 1000 100 10 1
Number of schemes
Number of subsets (exhaustive search)
0 5 10 15 20 Number of genes
N
Converting schemes to subsets
Using our recursive function from earlier, we can get all the schemes as a list of lists:
s1 s2 s3 s4 s5 s6 s7
s10 s3 s4 s5 s6 s7 s8 s9
all_schemes = [[1, 2, 3, 4, 5, 6, 7, 8, 9], [10, 3, 4, 5, 6, 7, 8, 9]]
s8 s9
Converting schemes to subsets
We want to convert our list of lists into a set of unique subsets, i.e.
all_schemes = [[1, 2, 3, 4, 5, 6, 7, 8, 9], [10, 3, 4, 5, 6, 7, 8, 9]]
all_subsets = set([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
Take 2 minutes and talk to your neighbour. Try to think of ways you would do this.
How do you think I did it?
Converting schemes to subsets Lots of solutions
A simple one:
all_schemes = [[1, 2, 3, 4, 5, 6, 7, 8, 9], [10, 3, 4, 5, 6, 7, 8, 9]]
all_subsets = set([]) #empty set
for i in all_schemes: # loop through lists for j in i: # loop through each list
all_subsets.add(j) # add value if not already there
>all_subsets
set([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
Converting schemes to subsets
Lots of solutions
A better one:
Google stack overflow python list of lists to set of unique values
Real programmers use Google and Stack Overflow. A lot.
You dont need to solve every problem from scratch.
Converting schemes to subsets
Convert our list of lists into a set of unique subsets, i.e.
all_schemes = [[1, 2, 3, 4, 5, 6, 7, 8, 9], [10, 3, 4, 5, 6, 7, 8, 9]]
all_subsets = set([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
Now we calculate the AIC of each subset and store it.
What data structure would you use?
Summary
We started with a naive solution to analyse all partitioning schemes independently
We only optimized this when we knew it wouldnt work
Optimisation takes many forms but the key is to find the biggest inefficiencies and improve them.
With a simple trick, we could speed up the code by millions of fold for large datasets!
Problem solved, right? Wrong!
Todays datasets can have 1000s of gene.
Even with our new algorithm thats at least 1.0710301 subsets to analyse
A Solution Heuristic search
Heuristic search
1. Pick a starting partitioning scheme
2. Get the AIC
3. Try a few similar partitioning schemes
4. Get the best AIC score
5. Go to step 3
6. Stop when you cant improve the AIC anymore
If: AIC8
cal
Sc
ien
ces
, Uni
ver
sity
of S
ydn
e
y, S
ydn
ey,
New
So
tral
AbstraMct any follow up papers and algorithms,
In phyloignencetilcuanadlysiens ofgmoulecpularcseoqumenceidnatag, pawrtitioninrgkinvwolvesiteshtimaDtingri.ndBepeundeintwmohdeles orf meolecular
evolutio
hool
ia
4Department of Statistics, University of Auckland, Auckland, New Zealand
*Corresponding author: E-mail: [email protected].
>3000 citations of the paper
Associate editor: Sudhir Kumar
n
fo
rd
if
fere
nt
s
ets
of si
tes
in
as
equ
ence
a
lig
nme
nt. C
step in most analyses because it can affect the accuracy of phylogenetic reconstruction. Despite this, partitioning schemes
wtheosheavI einatlrgoodruitchemd sto1d0a0y0.s of times faster than
large m
are often chosen without explicit statistical justification. Here, we describe two new objective methods for the combined
selection of best-fit partitioning schemes and nucleotide substitution models. These methods allow millions of partitioning schemes to be compared in realistic time frames and so permit the objective selection of partitioning schemes even for
ul
tilo
D
NA
da
t
as
et
s.
We
d
em
on
stra
te t
ha
tt
hes
cus
including both the ad hoc selection of partitioning schemes (e.g., partitioning by gene or codon position) and a recently proposed hierarchical clustering method. We have implemented these methods in an open-source program, PartitionFinder. This program allows users to select partitioning schemes and substitution models using a range of information-theoretic metrics (e.g., the Bayesian information criterion, akaike information criterion [AIC], and corrected AIC). We hope that PartitionFinder will encourage the objective selection of partitioning schemes and thus lead to improvements in phylogenetic analyses. PartitionFinder is written in Python and runs under Mac OSX 10.4 and above. The
ho
e
me
uth
W
os
ing a
al
es, A
n ap
us
p
ro
pria
te p
a
rtit
ion
ing
sc
thods significantly outperform previous approaches,
hem
e is
an
im
portant
Research article
program, source code, and a detailed manual are freely available from www.robertlanfear.com/partitionfinder.
Take homes
Start with simple, naive solutions
Build something that works
Avoid premature optimisation Use Google and Stack Overflow
Go and look up:
Git and version control
E.g. https://github.com/brettc/partitionfinder
Find a problem youre interested in. Start coding!
Reviews
There are no reviews yet.