Usage

tidytcells’ structure

tidytcells is comprised of several modules, each of which provide a set of functions that help process a particular type of data that bioinformaticians working on TCR data may come accross.

The submodules are:

Submodule

For

tidytcells.aa

General amino acid sequence data (e.g. peptide epitopes)

tidytcells.junction

TCR junction (CDR3) amino acid data

tidytcells.mhc

MHC gene/allele data

tidytcells.tcr

TCR gene/allele data

For ease of use, function APIs are standardised accross modules wherever possible- for example, each module has a function named standardise (see below) which standardises data from each category to be IMGT-compliant. Refer to here for a full review of tidytcells’ API.

Standardising TCR/MHC data using tidytcells and pandas

This is tidytcells’ primary usecase.

Since each of tidytcells’ submodules provide a standardise (standardize is a valid alias as well) function that automates data cleaning in their respective data category, these functions can be used in ensemble to clean a whole dataset of TCR/MHC data. Now, these standardise functions can be used on their own to clean individual pieces of data- that is for example:

>>> import tidytcells as tt
>>> orig = "A1"
>>> cleaned = tt.mhc.standardise(orig)
>>> cleaned
'HLA-A*01'

However, in real-life scenarios one would like to clean a whole set of data contained in a table. This can be achieved in a fairly straightforward manner by using tidytcells in conjunction with a data analysis tool like pandas. Pandas provides a nice way to blanket-apply data transformation functions to multiple DataFrame cells through their Series.map and DataFrame.applymap methods. Therefore, given a table of TCR and MHC data:

>>> import pandas as pd
>>> df = pd.DataFrame(
...     data=[
...         ["TRBV13*01",    "CASSYLPGQGDHYSNQPQHF", "trbj1-5*01"],
...         ["TCRBV28S1*01", "CASSLGQSGANVLTF",      "TRBJ2-6*01"],
...         ["unknown",      "ASSDWGSQNTLY",         "TRBJ2-4*01"]
...     ],
...     columns=["v", "junction", "j"]
... )
>>> df
              v              junction           j
0     TRBV13*01  CASSYLPGQGDHYSNQPQHF  trbj1-5*01
1  TCRBV28S1*01       CASSLGQSGANVLTF  TRBJ2-6*01
2       unknown          ASSDWGSQNTLY  TRBJ2-4*01

One can apply the standardise functions from tidytcells over the whole table at once, like so:

>>> cleaned = df.copy()
>>> cleaned[["v", "j"]] = df[["v", "j"]].applymap(tt.tcr.standardise)
>>> cleaned["junction"] = df["junction"].map(tt.junction.standardise)
>>> cleaned
           v              junction           j
0  TRBV13*01  CASSYLPGQGDHYSNQPQHF  TRBJ1-5*01
1  TRBV28*01       CASSLGQSGANVLTF  TRBJ2-6*01
2       None        CASSDWGSQNTLYF  TRBJ2-4*01

To apply the functions with optional arguments, one can wrap the standardise functions using lambda functions (see below). For use cases that require more flexibility, one could even define a wrapper function explicitly in the code.

>>> cleaned = df.copy()
>>> cleaned[["v", "j"]] = df[["v", "j"]].applymap(
...     lambda x: tt.tcr.standardise(
...         gene=x,
...         species="homosapiens",
...         precision="gene"
...     )
... )
>>> cleaned["junction"] = df["junction"].map(
...     lambda x: tt.junction.standardise(
...         seq=x,
...         strict=True
...     )
... )
>>> cleaned
        v              junction        j
0  TRBV13  CASSYLPGQGDHYSNQPQHF  TRBJ1-5
1  TRBV28       CASSLGQSGANVLTF  TRBJ2-6
2    None                  None  TRBJ2-4

For more complete documentations of the standardise functions, refer to the api reference.

Querying from IMGT TCR/MHC genes or alleles

tidytcells also provides the nifty functions tidytcells.tcr.query() and tidytcells.mhc.query() that allows users to obtain a list (actually a FrozenSet) of IMGT gene/allele names from the respective categories. The functions allow the user to provide various constraints relating to the genes/alleles’ functionalities and names to filter the query results as well. The query functions can be useful when checking if a particular dataset covers all the TCR or MHC genes, or counting how many genes fulfill a particular set of constraints.

Other MHC utilities

The mhc module provides a couple more extra goodies, including get_chain and get_class, each with self-explanatory names.