Usage#

tidytcells’ structure#

tidytcells is comprised of several modules, each of which provide a set of functions that help process a particular type of data that bioinformaticians working on T cell receptor (TR) or Major Histocompatibility (MH) data may come accross.

The submodules are:

Submodule

For

tidytcells.aa

General amino acid sequence data (e.g. peptide epitopes)

tidytcells.junction

TR JUNCTION or CDR3-IMGT amino acid sequence data

tidytcells.mh

MH gene/allele data

tidytcells.tr

TR gene/allele data

For ease of use, function APIs are standardized accross modules wherever possible- for example, each module has a function named standardize (see below) which standardizes data from each category to be IMGT-compliant (IMGT/GENE-DB, IMGT Repertoire). Refer to here for a full review of tidytcells’ API.

Standardizing TR/junction/peptide-MH data using tidytcells and pandas#

This is tidytcells’ primary usecase.

Since each of tidytcells’ submodules provide a standardize (standardise is a valid alias as well) function that automates data cleaning in their respective data category, these functions can be used in ensemble to clean a whole dataset of TR/MH data. Now, these standardize functions can be used on their own to clean individual pieces of data- that is for example:

>>> import tidytcells as tt
>>> orig = "A1"
>>> cleaned = tt.mh.standardize(orig)
>>> cleaned
'HLA-A*01'

However, in real-life scenarios one would like to clean a whole set of data contained in a table. This can be achieved in a fairly straightforward manner by using tidytcells in conjunction with a data analysis tool like pandas. Pandas provides a nice way to blanket-apply data transformation functions to multiple DataFrame cells through their Series.map and DataFrame.map methods. For example, given a table of TR/junction data (a similar procedure would work for tables with peptide-MH data as well):

>>> import pandas as pd
>>> df = pd.DataFrame(
...     data=[
...         ["TRBV13*01",    "CASSYLPGQGDHYSNQPQHF", "trbj1-5*01"],
...         ["TCRBV28S1*01", "CASSLGQSGANVLTF",      "TRBJ2-6*01"],
...         ["unknown",      "ASSDWGSQNTLY",         "TRBJ2-4*01"]
...     ],
...     columns=["v", "junction", "j"]
... )
>>> df
              v              junction           j
0     TRBV13*01  CASSYLPGQGDHYSNQPQHF  trbj1-5*01
1  TCRBV28S1*01       CASSLGQSGANVLTF  TRBJ2-6*01
2       unknown          ASSDWGSQNTLY  TRBJ2-4*01

One can apply the standardize functions from tidytcells over the whole table at once, like so:

>>> cleaned = df.copy()
>>> cleaned[["v", "j"]] = df[["v", "j"]].map(tt.tr.standardize)
>>> cleaned["junction"] = df["junction"].map(tt.junction.standardize)
>>> cleaned
           v              junction           j
0  TRBV13*01  CASSYLPGQGDHYSNQPQHF  TRBJ1-5*01
1  TRBV28*01       CASSLGQSGANVLTF  TRBJ2-6*01
2       None        CASSDWGSQNTLYF  TRBJ2-4*01

To apply the functions with optional arguments, one can wrap the standardize functions using lambda functions (see below). For use cases that require more flexibility, one could even define a wrapper function explicitly in the code.

>>> cleaned = df.copy()
>>> cleaned[["v", "j"]] = df[["v", "j"]].map(
...     lambda x: tt.tr.standardize(
...         symbol=x,
...         species="homosapiens",
...         precision="gene"
...     )
... )
>>> cleaned["junction"] = df["junction"].map(
...     lambda x: tt.junction.standardize(
...         seq=x,
...         strict=True
...     )
... )
>>> cleaned
        v              junction        j
0  TRBV13  CASSYLPGQGDHYSNQPQHF  TRBJ1-5
1  TRBV28       CASSLGQSGANVLTF  TRBJ2-6
2    None                  None  TRBJ2-4

For more complete documentations of the standardize functions, refer to the api reference.

Querying from IMGT TR/MH genes or alleles#

tidytcells also provides the nifty functions tidytcells.tr.query() and tidytcells.mh.query() that allows users to obtain a list (actually a FrozenSet) of IMGT gene/allele names from the respective categories. The functions allow the user to provide various constraints relating to the genes/alleles’ functionalities and names to filter the query results as well. The query functions can be useful when checking if a particular dataset covers all the TR or MH genes, or counting how many genes fulfill a particular set of constraints. Since tidytcells has a local copy of all relevant data pulled directly from IMGT’s GENE-DB (and updated with every new release), queries are blazingly fast and do not require an internet connection.

Querying TR gene amino acid sequence data from IMGT GENE-DB#

Sometimes, you have a T cell receptor represented as its V and J gene usages and its junction sequences, but you want to represent it in terms of its amino acid sequence. In such situations, the tidytcells.tr.get_aa_sequence() function can help. This function allows you to query amino acid sequence data for any functional TR gene. The function provides sequence data for the whole gene exome, as well as certain important regions (e.g. CDR1 and CDR2 in the V genes). The data is pulled from IMGT’s GENE-DB, and as is with the case with the tidytcells.tr.query() and tidytcells.mh.query(), all relevant data exists locally within tidytcells (and updated with every new release), so the queries are blazingly fast and requires no internet connection.

Other MH utilities#

The mh module provides a couple more extra goodies, including get_chain and get_class, each with self-explanatory names.