Genomics & Bioinformatics

Operations Research in Molecular Biology

Three foundational NP-hard problems in computational biology — genome assembly from short reads, multiple sequence alignment across gene families, and protein side-chain packing for structure prediction — each yield to combinatorial optimisation techniques developed in the OR tradition.

Molecular biology generates vast combinatorial search spaces: a human genome comprises 3 billion base pairs read in overlapping fragments, protein families demand alignment of dozens of divergent sequences, and side-chain conformations grow exponentially with chain length. In each case the underlying decision problem is NP-hard, yet practical algorithms — de Bruijn graph traversal, progressive dynamic programming, and dead-end elimination — exploit biological structure to achieve tractable solutions on real data.

Biological Context

DNA sequencers cannot read entire chromosomes in one pass. Illumina platforms produce short reads of 100–300 base pairs; Oxford Nanopore generates longer reads of 10,000–50,000 bp but with higher error rates. Genome assembly reconstructs the original sequence from these overlapping fragments — analogous to reassembling a shredded book where pages contain repeated paragraphs and some words are misspelled.

Computational Problem

The overlap graph approach models each read as a node and each suffix-prefix overlap as a directed edge. Finding a path that visits every node exactly once is the Hamiltonian path problem — NP-hard. The key insight of Pevzner, Tang, and Waterman (2001) was to reformulate assembly using de Bruijn graphs: decompose reads into k-mers, build a graph where (k-1)-mer overlaps form edges, and seek an Eulerian path — solvable in O(|E|) time. This transformation converts an NP-hard problem into a polynomial one by changing the representation.

Overlap Graph (NP-hard)

G = (V, E) where V = {reads}, E = {(r_i, r_j) : suffix(r_i) overlaps prefix(r_j)}

Find Hamiltonian path visiting every node exactly once

de Bruijn Graph (Polynomial)

Build k-mers from reads, create nodes for each (k-1)-mer

Edge (u, v) exists iff u is prefix and v is suffix of some k-mer

Find Eulerian path traversing every edge exactly once — O(|E|)

// Eulerian path exists iff graph is connected and at most 2 nodes have odd degree

de Bruijn Graph Visualiser

DNA Sequence

k-mer size (k): 3

Key References

Operational

Pevzner, P.A., Tang, H. & Waterman, M.S. (2001). “An Eulerian path approach to DNA fragment assembly.” Proceedings of the National Academy of Sciences, 98(17), 9748–9753.

Published

Nagarajan, N. & Pop, M. (2009). “Parametric complexity of sequence assembly: Theory and applications to next generation sequencing.” Journal of Computational Biology, 16(7), 897–908.

Published

Medvedev, P., Georgiou, K., Myers, G. & Brudno, M. (2007). “Computability of models for sequence assembly.” Algorithms in Bioinformatics (WABI 2007), Lecture Notes in Computer Science, vol 4645, 289–301.

Biological Context

Aligning multiple DNA or protein sequences reveals evolutionary conservation, identifies gene family members, and anchors phylogenetic tree construction. Conserved positions indicate functional or structural constraints; variable regions suggest adaptive evolution or relaxed selection.

Computational Problem

Pairwise alignment via Needleman-Wunsch dynamic programming runs in O(mn) time and is optimal. However, extending DP to n sequences requires an n-dimensional matrix with O(L^n) cells — Wang and Jiang (1994) proved that multiple sequence alignment (MSA) under the sum-of-pairs score is NP-hard for n ≥ 3. Practical heuristics include progressive alignment (ClustalW, which builds a guide tree then aligns pairs), iterative refinement (MUSCLE), and profile Hidden Markov Models (HMMER).

Pairwise Alignment (Optimal, O(mn))

F(i,j) = max { F(i-1,j-1) + s(a_i, b_j),

F(i-1,j) + gap_penalty,

F(i,j-1) + gap_penalty }

// Traceback from F(m,n) recovers optimal alignment

Multiple Alignment (NP-hard for n ≥ 3)

Score(A) = Σ_{all pairs (i,j)} pairwise_score(A_i, A_j)

Exact DP requires O(L^n) cells — intractable for n > 3

// Proved NP-hard by Wang & Jiang (1994), J Comp Bio 1(4)

Needleman-Wunsch Alignment

Sequence 1

Sequence 2

Match: 1

Mismatch: -1

Gap: -2

Key References

Published

Wang, L. & Jiang, T. (1994). “On the complexity of multiple sequence alignment.” Journal of Computational Biology, 1(4), 337–348.

Operational

Edgar, R.C. (2004). “MUSCLE: multiple sequence alignment with high accuracy and high throughput.” Nucleic Acids Research, 32(5), 1792–1797.

Operational

Eddy, S.R. (1998). “Profile hidden Markov models.” Bioinformatics, 14(9), 755–763.

Biological Context

Protein function is determined by three-dimensional structure. While the backbone fold defines the overall shape, side-chain conformations (rotamers) determine binding interfaces, catalytic sites, and stability. Each amino acid has a discrete set of preferred rotamer states; steric clashes and van der Waals interactions between neighbouring side chains must be minimised for a stable fold.

Computational Problem

Given a fixed backbone, assign one rotamer per residue position to minimise the total pairwise interaction energy. With r rotamers per position and n residues, the brute-force search space is r^n — Pierce and Winfree (2002) proved this is NP-hard. Dead-End Elimination (DEE), introduced by Desmet et al. (1992), prunes rotamers that provably cannot participate in the global minimum energy conformation, often reducing the search space by orders of magnitude.

Objective

min E_total = Σ_i E_self(i, r_i) + Σ_{i<j} E_pair(i, r_i, j, r_j)

// r_i is the rotamer assigned to residue i

DEE Pruning Criterion (Desmet et al. 1992)

Eliminate rotamer r at position i if there exists rotamer s at i such that:

E_self(i,r) - E_self(i,s) + Σ_{j ≠ i} min_{t} [E_pair(i,r,j,t) - E_pair(i,s,j,t)] > 0

// Rotamer r is provably suboptimal — can never be in GMEC

Side-Chain Packing: 5-Residue Toy Protein

Key References

Published

Pierce, N.A. & Winfree, E. (2002). “Protein design is NP-hard.” Protein Engineering, 15(10), 779–782.

Published

Desmet, J., De Maeyer, M., Hazes, B. & Lasters, I. (1992). “The dead-end elimination theorem and its use in protein side-chain positioning.” Nature, 356, 539–542.

Operational

Leaver-Fay, A. et al. (2011). “ROSETTA3: An object-oriented software suite for the simulation and design of macromolecules.” Methods in Enzymology, 487, 545–574.

Exploring optimisation in computational biology?

From genome assembly pipelines to protein design workflows, combinatorial optimisation techniques are essential tools in modern bioinformatics.

Connect on LinkedIn

Disclaimer

All examples are simplified for educational purposes. DNA sequences, alignment scores, and energy values are illustrative and do not represent real experimental data. The solvers demonstrate algorithmic principles rather than production-grade bioinformatics pipelines.