complexity
Compute the complexity of a pangenome graph.
The complexity command outputs a file with complexity metrics for an entire graph or for a specified set of regions from a graph.
If a GFA file is provided, the whole graph is processed.
If a GBZ file is provided, you must specify a region or list of regions (as a BED file).
Formulas
The complexity of a region is computed according to the following.
For a node \(n\) in the region, \(|n|\) represents the length (in base pairs) of the sequence. So a node with sequence ‘ATGAC’ would have \(|n|=5\), for example.
The percent of sample haplotypes (aka “walks”) that visit node \(n\) is represented by \(p_n\).
\(L\) can be computed in one of two ways:
If
--metrics sequniq-normwalkis specified, \(L\) is computed as the average length of all walks in the regionIf
--metrics sequniq-normnodeis specified, \(L\) instead represents the average length of all nodes in the region
Usage
panct complexity \
--region REGION or PATH \
--metrics sequniq-normwalk,sequniq-normnode \
--reference REFERENCE_ID \
--out PATH \
--verbosity [CRITICAL|ERROR|WARNING|INFO|DEBUG|NOTSET] \
GFAFILE
Warning
You need an index for the GBZ files, if working with them, or you must have gbz-base installed.
conda install -c conda-forge aryarm::gbz-base
Output
The output is a tab-separated file with the following columns:
numnodes: The number of nodes in the region
total_length: The total length of all nodes in the region
numwalks: The number of walks in the region
The complexity metrics requested by
--metrics. Refer to the formulas section.
If the --region option is specified, there will be one line in the output for every region. Each line will also be prefixed by the following columns:
chrom: The chromosome of the region
start: The start position of the region
end: The end position of the region
Examples
By default, tab-separated output is written to standard out.
panct complexity tests/data/basic.gfa
If your input graph is in the GBZ format, you may also use the --region option to select a specific region of the graph in the coordinates of the reference genome. Internally, this uses the gbz-base library to first subset the GBZ to a smaller GFA file.
panct complexity --region chrTest:0-1 tests/data/basic.gbz
You may also specify a list of regions as a BED file, instead. In this case, it might also be helpful to write output to a file.
panct complexity --out basic.tsv --region tests/data/basic.bed tests/data/basic.gbz
All files used in these examples are described here.
Additional examples
Below are additional examples based on the HPRC .gbz format graph (not included in this repo but available here).
# Run on a single region
panct complexity \
--region chr11:119077050-119178859 \
--metrics sequniq-normwalk,sequniq-normnode \
hprc-v1.1-mc-grch38.gbz
# Run on a file with a list of regions
panct complexity \
--region regions.bed --out test.tsv \
--metrics sequniq-normwalk,sequniq-normnode \
hprc-v1.1-mc-grch38.gbz
Detailed Usage
panct
panct: A collection of tools for working with pangenomes
panct [OPTIONS] COMMAND [ARGS]...
Options
- -v, --version
Show the application’s version and exit.
- Default:
False
- --install-completion
Install completion for the current shell.
- --show-completion
Show completion for the current shell, to copy it or customize the installation.
complexity
Compute complexity scores
panct complexity [OPTIONS] GRAPH
Options
- --region <region>
A region in which to compute complexity, or a BED file of regions
- Default:
''
- --metrics <metrics>
Comma-separated list of which complexity metrics to compute. Options: sequniq-normwalk,sequniq-normnode
- Default:
'sequniq-normwalk'
- -r, --reference <reference>
The ID of the reference sequence in the GFA file
- Default:
'GRCh38'
- -o, --out <output_file>
Name of output file
- Default:
PosixPath('/dev/stdout')
- -v, --verbosity <verbosity>
The level of verbosity desired
- Default:
<Verbosity.info: 'INFO'>- Options:
CRITICAL | ERROR | WARNING | INFO | DEBUG | NOTSET
Arguments
- GRAPH
Required argument
Path to the .gfa or .gbz file of a pangenome graph