[SOLVED] 代写 C data structure algorithm Scheme python statistic software Bayesian Go One way that programming helps us understand biology

30 $

File Name: 代写_C_data_structure_algorithm_Scheme_python_statistic_software_Bayesian_Go_One_way_that_programming_helps_us_understand_biology.zip
File Size: 1234.02 KB

SKU: 3536502696 Category: Tags: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

Or Upload Your Assignment Here:


One way that programming helps us understand biology
Rob Lanfear
Ecology & Evolution Australian National University

What is phylogenetics?

actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac

actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac

actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac
!” #” $” %”

What is partitioning?

actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac
!” #” $” %”

actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac

actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac
actgactgactgactgactgactgactgactgactgactgactgac
Partitioning can improve inference
!” #” $” %”
!” #” $” %”

So what’s the problem?

!” #” $” %”
!” #” $” %”
!” #” $” %”
!” #” $” %”
!” #” $” %”
!” #” $” %”
!” #” $” %”
!” #” $” %”
!” #” $” %”
!” # $” %”

!” #” $” %”
!” #” $” %”
!” #” $” %”
!” #” $” %”
!” #” $” %”
!” #” $” %”
!” #” $” %”
!” #” $” %”
!” #” $” %”
!” # $” %”

Let’s solve the problem

A Simple Algorithm
Compare all possible partitioning schemes.
1. Calculate the AIC score of every possible partitioning scheme.
AIC (Akaike’s Information Criterion) measures how far a model is from the truth.
2. Use the scheme with the smallest AIC score

https://github.com/brettc/partitionfinder/blob/master/partfinder/submodels.py

Now we find a new problem…

!” #” $” %”
!” #” $” %”
!” #” $” %”
!” #” $” %”
!” #” $” %”
!” #” $” %”
!” #” $” %”
!” #” $” %”
!” #” $” %”
!” # $” %”

How many schemes are there?
Number of genes (n)
Number of possible schemes (Bn)

Our first approach won’t work for large datasets.
What can we do?
OPTMISE!

s1 s2 s3 s4 s5 s6 s7
s10 s3 s4 s5 s6 s7 s8 s9
Analysing EVERY scheme EVERY time is inefficient Because we analyse the same subsets over and over…
To calculate the AIC of all possible partitioning schemes,
we only need to calculate the AIC of all possible subsets of genes
Then we can just add these together to get the AIC of the schemes…
s8 s9

s1 s2 s3 s4 s5 s6 s7
s10 s3 s4 s5 s6 s7 s8 s9
How do we count the number of unique subsets? Combinatorics!
For example, if we start with 20 genes
There are 51,724,158,235,372 possible partitioning schemes but these are made up of just 1,048,575 possible subsets That’s a time saving of 99.99998%!!!!!
s8 s9

How many subsets and schemes?
1E+14 1E+13 1E+12 1E+11 1E+10 1E+09
1000 00000 1000 0000 1000 000 1000 00 1000 0 1000 100 10 1
Number of schemes
Number of subsets (exhaustive search)
0 5 10 15 20 Number of genes
N

Converting schemes to subsets
• Using our recursive function from earlier, we can get all the schemes as a list of lists:
s1 s2 s3 s4 s5 s6 s7
s10 s3 s4 s5 s6 s7 s8 s9
all_schemes = [[1, 2, 3, 4, 5, 6, 7, 8, 9], [10, 3, 4, 5, 6, 7, 8, 9]]
s8 s9

Converting schemes to subsets
• We want to convert our list of lists into a set of unique subsets, i.e.
all_schemes = [[1, 2, 3, 4, 5, 6, 7, 8, 9], [10, 3, 4, 5, 6, 7, 8, 9]]
all_subsets = set([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
• Take 2 minutes and talk to your neighbour. Try to think of ways you would do this.
• How do you think I did it?

Converting schemes to subsets • Lots of solutions…
• A simple one:
all_schemes = [[1, 2, 3, 4, 5, 6, 7, 8, 9], [10, 3, 4, 5, 6, 7, 8, 9]]
all_subsets = set([]) #empty set
for i in all_schemes: # loop through lists for j in i: # loop through each list
all_subsets.add(j) # add value if not already there
>all_subsets
set([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

Converting schemes to subsets
• Lots of solutions…
• A better one:
– Google “stack overflow python list of lists to set of unique values”

Real programmers use Google and Stack Overflow. A lot.
You don’t need to solve every problem from scratch.

Converting schemes to subsets
• Convert our list of lists into a set of unique subsets, i.e.
all_schemes = [[1, 2, 3, 4, 5, 6, 7, 8, 9], [10, 3, 4, 5, 6, 7, 8, 9]]
all_subsets = set([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
• Now we calculate the AIC of each subset and store it.
• What data structure would you use?

Summary
• We started with a naïve solution to analyse all partitioning schemes independently
• We only optimized this when we knew it wouldn’t work
• Optimisation takes many forms – but the key is to find the biggest inefficiencies and improve them.
• With a simple trick, we could speed up the code by millions of fold for large datasets!

Problem solved, right? Wrong!
Today’s datasets can have 1000’s of gene.
Even with our new algorithm that’s at least 1.07×10301 subsets to analyse…

A Solution Heuristic search

Heuristic search
1. Pick a starting partitioning scheme
2. Get the AIC
3. Try a few similar partitioning schemes
4. Get the best AIC score
5. Go to step 3
6. Stop when you can’t improve the AIC anymore

If: AIC8 100,000 downloads of the software
cal
Sc
ien
ces
, Uni
ver
sity
of S
ydn
e
y, S
ydn
ey,
New
So
tral
Ab•straMct any follow up papers and algorithms,
In phyloignencetilcuanadlysiens ofgmoulecpularcseoqumenceidnatag, pawrtitioninrgkinvwolvesiteshtimaDtingri.ndBepeundeintwmohdeles orf meolecular
evolutio
hool
ia
4Department of Statistics, University of Auckland, Auckland, New Zealand
*Corresponding author: E-mail: [email protected].
• >3000 citations of the paper
Associate editor: Sudhir Kumar
n
fo
rd
if
fere
nt
s
ets
of si
tes
in
as
equ
ence
a
lig
nme
nt. C
step in most analyses because it can affect the accuracy of phylogenetic reconstruction. Despite this, partitioning schemes
wtheosheavI einatlrgoodruitchemd sto1d0a0y0.’s of times faster than
large m
are often chosen without explicit statistical justification. Here, we describe two new objective methods for the combined
selection of best-fit partitioning schemes and nucleotide substitution models. These methods allow millions of partitioning schemes to be compared in realistic time frames and so permit the objective selection of partitioning schemes even for
ul
tilo
D
NA
da
t
as
et
s.
We
d
em
on
stra
te t
ha
tt
hes
cus
including both the ad hoc selection of partitioning schemes (e.g., partitioning by gene or codon position) and a recently proposed hierarchical clustering method. We have implemented these methods in an open-source program, PartitionFinder. This program allows users to select partitioning schemes and substitution models using a range of information-theoretic metrics (e.g., the Bayesian information criterion, akaike information criterion [AIC], and corrected AIC). We hope that PartitionFinder will encourage the objective selection of partitioning schemes and thus lead to improvements in phylogenetic analyses. PartitionFinder is written in Python and runs under Mac OSX 10.4 and above. The
ho
e
me
uth
W
os
ing a
al
es, A
n ap
us
p
ro
pria
te p
a
rtit
ion
ing
sc
thods significantly outperform previous approaches,
hem
e is
an
im
portant
Research article
program, source code, and a detailed manual are freely available from www.robertlanfear.com/partitionfinder.

Take homes
• Start with simple, naïve solutions
– Build something that works
• Avoid premature optimisation • Use Google and Stack Overflow
• Go and look up:
– Git and version control
– E.g. https://github.com/brettc/partitionfinder
• Find a problem you’re interested in. • Start coding!

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] 代写 C data structure algorithm Scheme python statistic software Bayesian Go One way that programming helps us understand biology
30 $