Improving the Quality of Protein Sequence Alignments by Estimating their Accuracy
New technical advances in next-generation sequencing have provided biologists with massive amounts of DNA and protein data. A non-trivial step in the analysis of such data is aligning similar sequences for comparative studies. Each alignment tool offers different strengths and weaknesses. Aligners often have many user-specified parameters that can greatly affect the accuracy of the computed alignment, and users often rely on the default parameter setting. Researchers are forced to either use this default setting, or spend considerable time finding a suitable alternative. For a set of input sequences to align, our tool Facet
(feature-based accuracy estimator) selects a good aligner and a good parameter setting. Facet
does this by combining alignment features into an accuracy estimator. These independent features are informed by our knowledge of how proteins evolve and fold. Using Facet to choose a parameter setting improves alignment accuracy by up to 27% over the best default setting.
Diane Cook
Faculty: Project PI
You mention that selecting parameters can be time consuming. Does Facet
select parameter settings in an efficient manner? What is the computational
complexity (or estimate of run time given number of candidate alignments)
of Facet?
Dan DeBlasio
Graduate Student
Those are great questions Diane.
The selection of the parameters is done using an integer linear program. ILPs are exponential worst case, but are typically very fast to find the optimal solution. We precompute them for the end user so they do not have to worry about this step. On the website we have ensembles of parameters to use in Opal for as few as two and as many as 15 different parameter sets. The software to find the sets are not readily available right now, but I would be more than willing to provide them though it requires an additional software package called CPLEX.
As for the running time of Facet itself, we tried to keep each feature very low running time. I am unsure of the actual running times, but they are all polynomial (the time to compute is nominal so far in practice).
As far as running the candidate alignments, it depends on the aligner used. In our experiments we used Opal, and theoretically the running time would not increase much since you can generate the candidates in. We plan on including this parallelization in the version of Opal + Facet when we release it.
I hope I answered all of your questions. Please let me know if you have any other questions.
Julia Hirschberg
Faculty: Project PI
What sort/amount of training data do you need to use your alignment accuracy estimator?
Dan DeBlasio
Graduate Student
Hi Julia,
We used a subset of alignments from BENCH (which is a combination of several other benchmark suites) and PALI which are commonly used protein multiple sequence alignments. They are all alignments for which we know the “good” alignment and thus can measure the true accuracy. We can add more benchmark alignments that are known to be high quality. Our lab has found these two benchmark sets to be the most reliable, but we are always looking out for new benchmark sets.
As far as how many, we use just under one thousand (actually 861) alignments as benchmarks split into 3-fold training and testing groups (i.e. training with 2/3 and testing on the remaining 1/3 of the benchmarks). We found this to be sort of the upper bound in maximal training set that provided an increasing quality result. So thats a long winded answer to say, we needed about 575 training benchmarks to get the results we show, but using more didn’t hurt.
I hope that answered your questions. Dont hesitate to let me know if you have more questions.
Mostafa Bassiouni
Faculty: Project Co-PI
You mentioned Facet improves alignment accuracy by up to 27% over the best default setting. So the maximum improvement in accuracy is 27%, but what is the average improvement? Also the best default setting might not be the optimal setting. How does the accuracy obtained by Facet compare to the highest possible accuracy obtained by some optimal setting. Have you attempted to get an answer to this question and what approach would you use to get answer to this question?
Dan DeBlasio
Graduate Student
Hi Mostafa,
Its nice to get a great question from someone at my Alma Mater.
This question can best be answered by the top figure on the poster. The 27% figure, is actually the best average increase for any single bin. In some special cases we increase by more. But on average (across all bins) we increase the accuracy by 9.23% (the right subgraph of the first figure). Its a bit lower when we average over all alignments because the references have a high percentage of alignments that are “easy” and already had very good accuracy, so there is not much room for improvement. Our main goal was to focus on finding better alignments for the sets that are hard (i.e. those that have poor accuracy under some default) and I think thats what the original result points out. This is the same reason we bin alignments, so the majority of the alignments that are easily aligned don’t prevent us from finding a set of coefficients that improves the quality of these hard alignments.
So there are two ideas I think of the optimal; one is the parameter set for a single input out of the universe of parameters that produces the best alignment and the second is the parameter out of the ensemble that gives the best alignment. The second result can be seen in all three figures on the poster as a dashed line. The oracle is an advisor that knows the true accuracy for an alignment and is able to always choose the optimal parameter from the ensemble for an input.
If we consider the universe of parameter sets for each input, we get an average accuracy of 68% (compared to the oracle on an ensemble of size 10 that is 64%).
As for how we attempt to answer the question of how close we get to matching the optimal alignment accuracy for each input, we show this difference on the figures. We are currently working on a way to optimally find the coefficients to minimize this difference. On the poster we show a method to optimize Facet by minimizing the difference in true accuracy and estimated accuracy (see section “Estimator Coefficients”) but we are now trying to minimize what we call the advising error and that is defined to be the difference between the best possible alignment for an input and the accuracy of the alignment chosen by Facet. It was only recently that I developed a way to encode the advising process as a mathematical program so we can simulate it within the optimization. This will only change the feature weights when computing a Facet score, not the actual process of parameter ensemble finding and eventual advising.
I hope that answered your questions. Please dont hesitate to let me know if you have any other questions.
Mary Kathryn Cowles
Faculty: Project PI
What exactly does “alignment accuracy” mean? What is the true accuracy that you are trying to estimate?
Dan DeBlasio
Graduate Student
Hi Mary,
Alignment accuracy is a measure of how well a multiple sequence alignment matches with a known good alignment for the input sequence. We look at pairs of letters inside designated core columns, these are columns in the good alignment that very important and in many cases structurally conserved. The accuracy is then measured as the percentage of pairs of letters inside these core columns that are also aligned in the same column together in the computed alignment.
In practice this good reference alignment doesn’t exist (otherwise we wouldn’t need a multiple alignment tool) so we can’t measure the accuracy in this manner. Thats why we are left with the task of accuracy estimation, i.e. trying to guess what this percentage is without being able to see at what the alignment should look like.
I hope that provides some motivation why we are trying to accomplish this task. Please feel free to let me know if you have any other questions.