Dynamic μ-PBWT: Dynamic Run-length Compressed PBWT for Biobank Scale Data

PMID: 39975111
Source: bioRxiv
Publication date: 2025-02-20
Year: 2025

Abstract

Durbin's positional Burrows-Wheeler transform (PBWT) supports efficient haplotype matching and queries given a panel of haplotypes. It has been widely used for statistical phasing, imputation and identity-by-descent (IBD) detection. However, the original PBWT panel doesn't support dynamic updates when haplotypes need to be added or deleted from the panel. Dynamic-PBWT (d-PBWT) solved this problem but it is not memory efficient. While the memory constraint problem of the PBWT has been tackled by Syllable-PBWT and mu -PBWT, these are static data structures that do not allow updates. Additionally, Syllable-PBWT only supports long-match query and mu -PBWT only supports set-maximal match query, limiting their functionality in the compressed form. In this paper, we present Dynamic mu -PBWT (which can also be seen as compressed d-PBWT) that is memory efficient and supports dynamic updates. We run-length compress PBWT to achieve better compression rate and store the runs in the self-balancing trees to enable dynamic updates. We show that the number of updates per insertion or deletion in the tree at each site is constant regardless of the number of haplotypes in the panel and the updates can be made without decompressing the index. In addition, we use orders of magnitude less memory than d-PBWT. We also provide a long match query algorithm that can easily be extended back to the original mu -PBWT. Overall, the flexibility and space-efficiency of Dynamic mu -PBWT makes it a potential index data structure for biobank scale genetic data analyses. The source code for Dynamic mu -PBWT is available at https://github.com/ucfcbb/Dynamic-mu-PBWT.