Usage
tidytcells’ structure
tidytcells is comprised of several modules, each of which provide a set of functions that help process a particular type of data that bioinformaticians working on T cell receptor (TR) or Major Histocompatibility (MH) data may come accross.
The submodules are:
Submodule |
For |
|---|---|
General amino acid sequence data (e.g. peptide epitopes) |
|
TR JUNCTION or CDR3-IMGT amino acid sequence data |
|
MH gene/allele data |
|
TR gene/allele data |
For ease of use, function APIs are standardized accross modules wherever possible- for example, each module has a function named standardize (see below) which standardizes data from each category to be IMGT-compliant (IMGT/GENE-DB, IMGT Repertoire).
Refer to here for a full review of tidytcells’ API.
Standardizing TR/junction/peptide-MH data using tidytcells and pandas
This is tidytcells’ primary usecase.
Since each of tidytcells’ submodules provide a standardize (standardise is a valid alias as well) function that automates data cleaning in their respective data category, these functions can be used in ensemble to clean a whole dataset of TR/MH data.
Now, these standardize functions can be used on their own to clean individual pieces of data- that is for example:
>>> import tidytcells as tt
>>> orig = "A1"
>>> cleaned = tt.mh.standardize(orig)
>>> cleaned
'HLA-A*01'
However, in real-life scenarios one would like to clean a whole set of data contained in a table.
This can be achieved in a fairly straightforward manner by using tidytcells in conjunction with a data analysis tool like pandas.
Pandas provides a nice way to blanket-apply data transformation functions to multiple DataFrame cells through their Series.map and DataFrame.map methods.
For example, given a table of TR/junction data (a similar procedure would work for tables with peptide-MH data as well):
>>> import pandas as pd
>>> df = pd.DataFrame(
... data=[
... ["TRBV13*01", "CASSYLPGQGDHYSNQPQHF", "trbj1-5*01"],
... ["TCRBV28S1*01", "CASSLGQSGANVLTF", "TRBJ2-6*01"],
... ["unknown", "ASSDWGSQNTLY", "TRBJ2-4*01"]
... ],
... columns=["v", "junction", "j"]
... )
>>> df
v junction j
0 TRBV13*01 CASSYLPGQGDHYSNQPQHF trbj1-5*01
1 TCRBV28S1*01 CASSLGQSGANVLTF TRBJ2-6*01
2 unknown ASSDWGSQNTLY TRBJ2-4*01
One can apply the standardize functions from tidytcells over the whole table at once, like so:
>>> cleaned = df.copy()
>>> cleaned[["v", "j"]] = df[["v", "j"]].map(tt.tr.standardize)
>>> cleaned["junction"] = df["junction"].map(tt.junction.standardize)
>>> cleaned
v junction j
0 TRBV13*01 CASSYLPGQGDHYSNQPQHF TRBJ1-5*01
1 TRBV28*01 CASSLGQSGANVLTF TRBJ2-6*01
2 None CASSDWGSQNTLYF TRBJ2-4*01
To apply the functions with optional arguments, one can wrap the standardize functions using lambda functions (see below).
For use cases that require more flexibility, one could even define a wrapper function explicitly in the code.
>>> cleaned = df.copy()
>>> cleaned[["v", "j"]] = df[["v", "j"]].map(
... lambda x: tt.tr.standardize(
... gene=x,
... species="homosapiens",
... precision="gene"
... )
... )
>>> cleaned["junction"] = df["junction"].map(
... lambda x: tt.junction.standardize(
... seq=x,
... strict=True
... )
... )
>>> cleaned
v junction j
0 TRBV13 CASSYLPGQGDHYSNQPQHF TRBJ1-5
1 TRBV28 CASSLGQSGANVLTF TRBJ2-6
2 None None TRBJ2-4
For more complete documentations of the standardize functions, refer to the api reference.
Querying from IMGT TR/MH genes or alleles
tidytcells also provides the nifty functions tidytcells.tr.query() and tidytcells.mh.query() that allows users to obtain a list (actually a FrozenSet) of IMGT gene/allele names from the respective categories.
The functions allow the user to provide various constraints relating to the genes/alleles’ functionalities and names to filter the query results as well.
The query functions can be useful when checking if a particular dataset covers all the TR or MH genes, or counting how many genes fulfill a particular set of constraints.
Other MH utilities
The mh module provides a couple more extra goodies, including get_chain and get_class, each with self-explanatory names.