Genomics & Bioinformatics
Operations Research in Molecular Biology
Three foundational NP-hard problems in computational biology — genome assembly from short reads, multiple sequence alignment across gene families, and protein side-chain packing for structure prediction — each yield to combinatorial optimisation techniques developed in the OR tradition.
Molecular biology generates vast combinatorial search spaces: a human genome comprises 3 billion base pairs read in overlapping fragments, protein families demand alignment of dozens of divergent sequences, and side-chain conformations grow exponentially with chain length. In each case the underlying decision problem is NP-hard, yet practical algorithms — de Bruijn graph traversal, progressive dynamic programming, and dead-end elimination — exploit biological structure to achieve tractable solutions on real data.
Biological Context
DNA sequencers cannot read entire chromosomes in one pass. Illumina platforms produce short reads of 100–300 base pairs; Oxford Nanopore generates longer reads of 10,000–50,000 bp but with higher error rates. Genome assembly reconstructs the original sequence from these overlapping fragments — analogous to reassembling a shredded book where pages contain repeated paragraphs and some words are misspelled.
Computational Problem
The overlap graph approach models each read as a node and each suffix-prefix overlap as a directed edge. Finding a path that visits every node exactly once is the Hamiltonian path problem — NP-hard. The key insight of Pevzner, Tang, and Waterman (2001) was to reformulate assembly using de Bruijn graphs: decompose reads into k-mers, build a graph where (k-1)-mer overlaps form edges, and seek an Eulerian path — solvable in O(|E|) time. This transformation converts an NP-hard problem into a polynomial one by changing the representation.
de Bruijn Graph (Polynomial)
de Bruijn Graph Visualiser
Key References
Biological Context
Aligning multiple DNA or protein sequences reveals evolutionary conservation, identifies gene family members, and anchors phylogenetic tree construction. Conserved positions indicate functional or structural constraints; variable regions suggest adaptive evolution or relaxed selection.
Computational Problem
Pairwise alignment via Needleman-Wunsch dynamic programming runs in O(mn) time and is optimal. However, extending DP to n sequences requires an n-dimensional matrix with O(L^n) cells — Wang and Jiang (1994) proved that multiple sequence alignment (MSA) under the sum-of-pairs score is NP-hard for n ≥ 3. Practical heuristics include progressive alignment (ClustalW, which builds a guide tree then aligns pairs), iterative refinement (MUSCLE), and profile Hidden Markov Models (HMMER).
Multiple Alignment (NP-hard for n ≥ 3)
Needleman-Wunsch Alignment
Key References
Biological Context
Protein function is determined by three-dimensional structure. While the backbone fold defines the overall shape, side-chain conformations (rotamers) determine binding interfaces, catalytic sites, and stability. Each amino acid has a discrete set of preferred rotamer states; steric clashes and van der Waals interactions between neighbouring side chains must be minimised for a stable fold.
Computational Problem
Given a fixed backbone, assign one rotamer per residue position to minimise the total pairwise interaction energy. With r rotamers per position and n residues, the brute-force search space is r^n — Pierce and Winfree (2002) proved this is NP-hard. Dead-End Elimination (DEE), introduced by Desmet et al. (1992), prunes rotamers that provably cannot participate in the global minimum energy conformation, often reducing the search space by orders of magnitude.
DEE Pruning Criterion (Desmet et al. 1992)
Side-Chain Packing: 5-Residue Toy Protein
Key References
Exploring optimisation in computational biology?
From genome assembly pipelines to protein design workflows, combinatorial optimisation techniques are essential tools in modern bioinformatics.