Usage

`tidytcells`’ structure

tidytcells is comprised of several modules, each of which provide a set of functions that help process a particular type of data that bioinformaticians working on TCR data may come accross.

The submodules are:

Submodule	For
`tidytcells.aa`	General amino acid sequence data (e.g. peptide epitopes)
`tidytcells.junction`	TCR junction (CDR3) amino acid data
`tidytcells.mhc`	MHC gene/allele data
`tidytcells.tcr`	TCR gene/allele data

For ease of use, function APIs are standardized accross modules wherever possible- for example, each module has a function named standardize (see below) which standardizes data from each category to be IMGT-compliant. Refer to here for a full review of tidytcells’ API.

Standardizing TCR/MHC data using `tidytcells` and pandas 

This is tidytcells’ primary usecase.

Since each of tidytcells’ submodules provide a standardize (standardise is a valid alias as well) function that automates data cleaning in their respective data category, these functions can be used in ensemble to clean a whole dataset of TCR/MHC data. Now, these standardize functions can be used on their own to clean individual pieces of data- that is for example:

>>> import tidytcells as tt
>>> orig = "A1"
>>> cleaned = tt.mhc.standardize(orig)
>>> cleaned
'HLA-A*01'

However, in real-life scenarios one would like to clean a whole set of data contained in a table. This can be achieved in a fairly straightforward manner by using tidytcells in conjunction with a data analysis tool like pandas. Pandas provides a nice way to blanket-apply data transformation functions to multiple DataFrame cells through their Series.map and DataFrame.applymap methods. Therefore, given a table of TCR and MHC data:

>>> import pandas as pd
>>> df = pd.DataFrame(
...     data=[
...         ["TRBV13*01",    "CASSYLPGQGDHYSNQPQHF", "trbj1-5*01"],
...         ["TCRBV28S1*01", "CASSLGQSGANVLTF",      "TRBJ2-6*01"],
...         ["unknown",      "ASSDWGSQNTLY",         "TRBJ2-4*01"]
...     ],
...     columns=["v", "junction", "j"]
... )
>>> df
              v              junction           j
0     TRBV13*01  CASSYLPGQGDHYSNQPQHF  trbj1-5*01
1  TCRBV28S1*01       CASSLGQSGANVLTF  TRBJ2-6*01
2       unknown          ASSDWGSQNTLY  TRBJ2-4*01

One can apply the standardize functions from tidytcells over the whole table at once, like so:

>>> cleaned = df.copy()
>>> cleaned[["v", "j"]] = df[["v", "j"]].applymap(tt.tcr.standardize)
>>> cleaned["junction"] = df["junction"].map(tt.junction.standardize)
>>> cleaned
           v              junction           j
0  TRBV13*01  CASSYLPGQGDHYSNQPQHF  TRBJ1-5*01
1  TRBV28*01       CASSLGQSGANVLTF  TRBJ2-6*01
2       None        CASSDWGSQNTLYF  TRBJ2-4*01

To apply the functions with optional arguments, one can wrap the standardize functions using lambda functions (see below). For use cases that require more flexibility, one could even define a wrapper function explicitly in the code.

>>> cleaned = df.copy()
>>> cleaned[["v", "j"]] = df[["v", "j"]].applymap(
...     lambda x: tt.tcr.standardize(
...         gene=x,
...         species="homosapiens",
...         precision="gene"
...     )
... )
>>> cleaned["junction"] = df["junction"].map(
...     lambda x: tt.junction.standardize(
...         seq=x,
...         strict=True
...     )
... )
>>> cleaned
        v              junction        j
0  TRBV13  CASSYLPGQGDHYSNQPQHF  TRBJ1-5
1  TRBV28       CASSLGQSGANVLTF  TRBJ2-6
2    None                  None  TRBJ2-4

For more complete documentations of the standardize functions, refer to the api reference.

Querying from IMGT TCR/MHC genes or alleles

tidytcells also provides the nifty functions tidytcells.tcr.query() and tidytcells.mhc.query() that allows users to obtain a list (actually a FrozenSet) of IMGT gene/allele names from the respective categories. The functions allow the user to provide various constraints relating to the genes/alleles’ functionalities and names to filter the query results as well. The query functions can be useful when checking if a particular dataset covers all the TCR or MHC genes, or counting how many genes fulfill a particular set of constraints.

Other MHC utilities

The mhc module provides a couple more extra goodies, including get_chain and get_class, each with self-explanatory names.

Usage

tidytcells’ structure

Standardizing TCR/MHC data using tidytcells and pandas

Querying from IMGT TCR/MHC genes or alleles

Other MHC utilities

`tidytcells`’ structure

Standardizing TCR/MHC data using `tidytcells` and pandas 