Usage

tidytcells’ structure

tidytcells is comprised of several modules, each of which provide a set of functions that help process a particular type of data that bioinformaticians working on T cell receptor (TR) or Major Histocompatibility (MH) data may come accross.

The submodules are:

Submodule

For

tidytcells.aa

General amino acid sequence data (e.g. peptide epitopes)

tidytcells.junction

TR JUNCTION or CDR3-IMGT amino acid sequence data

tidytcells.mh

MH gene/allele data

tidytcells.tr

TR gene/allele data

For ease of use, function APIs are standardized accross modules wherever possible- for example, each module has a function named standardize (see below) which standardizes data from each category to be IMGT-compliant (IMGT/GENE-DB, IMGT Repertoire). Refer to here for a full review of tidytcells’ API.

Standardizing TR/junction/peptide-MH data using tidytcells and pandas

This is tidytcells’ primary usecase.

Since each of tidytcells’ submodules provide a standardize (standardise is a valid alias as well) function that automates data cleaning in their respective data category, these functions can be used in ensemble to clean a whole dataset of TR/MH data. Now, these standardize functions can be used on their own to clean individual pieces of data- that is for example:

>>> import tidytcells as tt
>>> orig = "A1"
>>> cleaned = tt.mh.standardize(orig)
>>> cleaned
'HLA-A*01'

However, in real-life scenarios one would like to clean a whole set of data contained in a table. This can be achieved in a fairly straightforward manner by using tidytcells in conjunction with a data analysis tool like pandas. Pandas provides a nice way to blanket-apply data transformation functions to multiple DataFrame cells through their Series.map and DataFrame.map methods. For example, given a table of TR/junction data (a similar procedure would work for tables with peptide-MH data as well):

>>> import pandas as pd
>>> df = pd.DataFrame(
...     data=[
...         ["TRBV13*01",    "CASSYLPGQGDHYSNQPQHF", "trbj1-5*01"],
...         ["TCRBV28S1*01", "CASSLGQSGANVLTF",      "TRBJ2-6*01"],
...         ["unknown",      "ASSDWGSQNTLY",         "TRBJ2-4*01"]
...     ],
...     columns=["v", "junction", "j"]
... )
>>> df
              v              junction           j
0     TRBV13*01  CASSYLPGQGDHYSNQPQHF  trbj1-5*01
1  TCRBV28S1*01       CASSLGQSGANVLTF  TRBJ2-6*01
2       unknown          ASSDWGSQNTLY  TRBJ2-4*01

One can apply the standardize functions from tidytcells over the whole table at once, like so:

>>> cleaned = df.copy()
>>> cleaned[["v", "j"]] = df[["v", "j"]].map(tt.tr.standardize)
>>> cleaned["junction"] = df["junction"].map(tt.junction.standardize)
>>> cleaned
           v              junction           j
0  TRBV13*01  CASSYLPGQGDHYSNQPQHF  TRBJ1-5*01
1  TRBV28*01       CASSLGQSGANVLTF  TRBJ2-6*01
2       None        CASSDWGSQNTLYF  TRBJ2-4*01

To apply the functions with optional arguments, one can wrap the standardize functions using lambda functions (see below). For use cases that require more flexibility, one could even define a wrapper function explicitly in the code.

>>> cleaned = df.copy()
>>> cleaned[["v", "j"]] = df[["v", "j"]].map(
...     lambda x: tt.tr.standardize(
...         gene=x,
...         species="homosapiens",
...         precision="gene"
...     )
... )
>>> cleaned["junction"] = df["junction"].map(
...     lambda x: tt.junction.standardize(
...         seq=x,
...         strict=True
...     )
... )
>>> cleaned
        v              junction        j
0  TRBV13  CASSYLPGQGDHYSNQPQHF  TRBJ1-5
1  TRBV28       CASSLGQSGANVLTF  TRBJ2-6
2    None                  None  TRBJ2-4

For more complete documentations of the standardize functions, refer to the api reference.

Querying from IMGT TR/MH genes or alleles

tidytcells also provides the nifty functions tidytcells.tr.query() and tidytcells.mh.query() that allows users to obtain a list (actually a FrozenSet) of IMGT gene/allele names from the respective categories. The functions allow the user to provide various constraints relating to the genes/alleles’ functionalities and names to filter the query results as well. The query functions can be useful when checking if a particular dataset covers all the TR or MH genes, or counting how many genes fulfill a particular set of constraints.

Other MH utilities

The mh module provides a couple more extra goodies, including get_chain and get_class, each with self-explanatory names.