Protein homology detection using string alignment kernels
[paper]
Hiroto Saigo, Jean-Philippe Vert, Nobuhisa Ueda and Tatsuya Akutsu
Bioinformatics 20(11), 1682-1689, 2004
Abstract
Motivation: Remote homology detection between protein sequences is a central problem in computational biology. Discriminative methods involving support vector machines (SVMs) are currently the most effective methods for the problem of superfamily recognition in the Structural Classification Of Proteins (SCOP) database. The performance of SVMs depends critically on the kernel function used to quantify the similarity between sequences.
Results: We propose new kernels for strings adapted to biological sequences, which we call local alignment kernels. These kernels measure the similarity between two sequences by summing up scores obtained from local alignments with gaps of the sequences. When tested in combination with SVM on their ability to recognize SCOP superfamilies on a benchmark dataset, the new kernels outperform state-of-the-art methods for remote homology detection.
Below are software and dataset used in the paper. The dataset are copied from the web page maintained by the author of SVM-PAIRWISE (2018/5/18).
xxx
-
Code LAkernel-0.3.3.tar.gz
Dataset
- Tab-delimited table
specifying the positive and negative training and test sets for each
family. Each row is one sequence, and each column is one family. (0
= not present; 1 = positive train; 2 = negative train; 3 = positive
test; 4 = negative test).
-
Names of the SCOP
families.
-
Sequence file in FASTA format
containing all sequences in SCOP version 1.53 with a pairwise
similarity threshold of 10-25.
-
Gzipped, tab-delimited table
containing Smith-Waterman p-values for all pairs of sequences (31 MB).
The Original website of SVM-PAIRWISE is
https://noble.gs.washington.edu/proj/svm-pairwise/