tidytcells: Standardize TR/MH/IG data

tidytcells: Standardize TR/MH/IG data#

Cartoon of a T cell receptor binding to a peptide MH complex (left), and a cartoon demonstrating how messy data can be cleaned using tidytcells (right).

a) A diagram of a T cell receptor (TR) interacting with a peptide-Major Histocompatibility (MH) complex. The V, D and J minigenes comprising each TR chain are shown by color. The red dotted lines point out the junction sequences of both TR chains. b) An illustration of how tidytcells can help clean TR data. By using tidytcells, non-standard nomenclature in the “messy data” is corrected, and any invalid values are filtered out.#

tidytcells is a lightweight python package that cleans and standardizes T cell receptor (TR), Major Histocompatibility Complex (MH), and Immunoglobulin (IG) data to be IMGT-compliant (IMGT/GENE-DB, IMGT Repertoire). The main purpose of the package is to solve the problem of parsing and collating together non-standardized TR datasets. It is often difficult to compile TR data from multiple sources because the formats/nomenclature of how each dataset encodes TR and MH gene names are slightly different, or even inconsistent within themselves. tidytcells can ameliorate this issue by auto-correcting and auto-standardizing your data!

Tip

The tidytcells.ig submodule is newly added! It provides functionality for standardizing, querying, and retrieving amino acid sequences for immunoglobulin genes/alleles, similar to the existing TR and MH modules. Thanks to Lonneke for implementing this module!

Contents#

Indices and tables#