Document Type


Creative Commons License

Creative Commons Attribution 4.0 License
This work is licensed under a Creative Commons Attribution 4.0 License.


Bioinformatics | Biotechnology | Computational Biology | Genetics and Genomics | Immunology and Infectious Disease

CIT Disciplines

1.6 BIOLOGICAL SCIENCES; Biochemistry and molecular biology; Bioinformatics

Publication Details



This paper reviews recent research relating to the application of bioinformatics approaches to determining HIV-1 protease specificity, outlines outstanding issues, and presents a new approach to addressing these issues. Leading machine learning theory for the problem currently suggests that the direct encoding of the physicochemical properties of the amino acid substrates is not required for optimal performance. A number of amino acid encoding approaches which incorporate potentially relevant physicochemical properties of the substrate are identified, and are evaluated using a nonlinear task decomposition based neuroevolution algorithm. The results are evaluated, and compared against a recent benchmark set on a nonlinear classifier using only amino acid sequence and identity information. Ensembles of these nonlinear classifiers using the physicochemical properties of the substrate are demonstrated to consistently outperform the recently published state-of-the-art linear support vector machine based approach in out-of-sample evaluations.

t0001-10.1080_21655979.2016.1149271.csv (1 kB)
Table 1. Diversity matrix of the different approaches taken to defining HIV-1 specificity, as noted in the literature reviewed. [*] denotes future planned work.

t0002-10.1080_21655979.2016.1149271.csv (1 kB)
Table 2. Selection of amino acid encoding formats identified in the literature review.

t0003-10.1080_21655979.2016.1149271.csv (1 kB)
Table 3. Summary statistics for the evaluated amino acid encoding approaches. Each row represents the performance of classifiers trained on same 20 samplings of the {746, 1625, Schilling} dataset and evaluated on the Impens data set. The performance of the LSVM using orthogonal encoding trained on the 20 samplings is included for reference. Results are rounded to 3 decimal places.

t0004-10.1080_21655979.2016.1149271.csv (1 kB)
Table 4. The performance of the classifiers used to generate Table 3 for each amino acid encoding approach, when taken as an ensemble.

t0005-10.1080_21655979.2016.1149271.csv (1 kB)
Table 5. Performance of an ensemble of 100 MFF-NEAT classifiers using {Niu, Physicochemical, Orthogonal} encoding, trained on various samplings of the {746, 1625, Schilling} dataset and evaluated on the Impens data set. An ensemble of LSVMs trained on the same samplings is included for reference.

t0006-10.1080_21655979.2016.1149271.csv (1 kB)
Table 6. Performance of an LSVM on the Impens dataset, when trained on the full {746, 1625, Schilling} data set, using a value of 1.284 for C.