GenomeKit: Making genomic analysis fast, robust, and easy

December 14, 2021

Steve Chan, Senior Software Engineer

What is GenomeKit?

In this post, we’ll be covering how GenomeKit, one of the most widely used proprietary libraries at Deep Genomics, solves many common issues when working with genomic data. 

GenomeKit is our in-house Python solution for fast and easy access to genomic resources such as sequence, data tracks, and annotations. GenomeKit is also designed to work with genome variants, giving users a powerful way to extract features for different genotypes.

The goal is to let machine learning researchers build data sets easily and allow creativity in how those data sets are designed. 

But before we can do that, let’s describe, and then address, some of the previously alluded to issues when trying to reference and retrieve DNA sequences.

Chromosome ID inconsistencies

Different genomic source files may use "chr1", "1", or even "NC_000001.11" for the first chromosome. 

GenomeKit standardizes everything to the "chr1" format so we don’t need to memorize and swap naming schemes when working with multiple datasets.

>>> h38.interval("chr1", ...)

Off-by-one errors

When coordinates are specified for a genome, they can be 0-based or 1-based depending on the standard used. 

An off-by-one error when mutating DNA would be a critical mistake, so we've standardized on 0-based. 

We also validate all mutated sequences from standard 1-based variant strings.

>>> h38.variant("chr13:51961848:A:C")
ValueError: Variant's ref sequence does not match the genome (C != A)

>>> h38.variant("chr13:51961849:A:C")
<Variant chr13:51961849:A:C:h38>

>>> h38.variant("chr13:51961849:A:C").interval
Interval("chr13", "+", 51961848, 51961849, "h38")

More off-by-one errors

Not satisfied with 0-based versus 1-based, the bioinformatics community has also embraced having fully-closed, fully-open, and half-open intervals to specify where to stop.

We picked half-open and it’s been working well for human-readability and computational efficiency.

>>> hg38.dna(h38.interval("chr1", "+", 10_000, 10_000))
''

>>> hg38.dna(h38.interval("chr1", "+", 10_000, 10_001))
'T'

DNA sequence inconsistencies

"chrM:8000-9000" is valid for both mice and humans, but even within a single organism, there are different versions of the reference DNA sequence and feature annotations. For instance, since chrM is mitochondrial, does the human reference source NC_001807.4 or the more standard NC_012920.1? 

To avoid any more confusion, GenomeKit bakes all of these decisions into known reference genome IDs.

>>> rs105 = gk.Genome("ncbi_refseq.v105.20190906")
>>> gk.Genome("h38").dna(rs105.transcripts["NM_000053.4"])
ValueError: genome_dna::get(h37:chr13,52585585,52506804,-) cannot be used on h38.

>>> rs105.dna(rs105.transcripts["NM_000053.4"])
'CTCA...'

Difficult interval manipulations

5-prime/3-prime and how strandedness reverses their positions can be confusing for many people. 

Fortunately, most operations (e.g., intersections, finding overlaps, applying variants, building data tracks, etc) are already written in GenomeKit with strict checks to alert in case of errors.

A variety of extensions are also available to handle things like lifting from and lowering to spliced mRNA coordinates.

>>> nm_53 = rs105.transcripts["NM_000053.4"]
>>> gk.Interval.spanning(nm_53.exons[0].end3, nm_53.exons[-1].end5)
Interval("chr13", "-", 52509165, 52585422, "h37")

>>> _.shift(1).expand(0, 10)
Interval("chr13", "-", 52509154, 52585421, "h37")

>>> _.as_positive_strand().shift(1).expand(0, 10)
Interval("chr13", "+", 52509155, 52585432, "h37")

Runtime performance

GenomeKit encodes memory-mapped data such that loading and searching through all 1.8 million exons and subsetting 3 million DNA nucleotides all happens on a laptop without any noticeable delay. It's actually 5x faster for our workflows than any publicly available alternatives.

Memory requirements

Partly due to performance optimizations, but also to reduce memory requirements, the binary structures and Python objects in GenomeKit are tightly packed and organized to minimize page faults. 

Working with millions of intervals is not a problem, as they use over 2.5x less memory compared to alternative libraries.

All of this sounds super interesting and productivity boosting!

If you’re curious to work on GenomeKit or other software tools developed by our Software Engineers at Deep Genomics, check out our careers page and we'll be happy to have a chat with you. While we have you here, we'd also like to thank Andrew Delong for creating, designing, and implementing the majority of this wonderful library.