complexity

Compute the complexity of a pangenome graph.

The complexity command outputs a file with complexity metrics for an entire graph or for a specified set of regions from a graph.

If a GFA file is provided, the whole graph is processed.

If a GBZ file is provided, you must specify a region or list of regions (as a BED file).

Formulas

The complexity of a region is computed according to the following.

\[\sum_n \frac{|n|*p_n*(1-p_n)}L\]

For a node \(n\) in the region, \(|n|\) represents the length (in base pairs) of the sequence. So a node with sequence ‘ATGAC’ would have \(|n|=5\), for example.

The percent of sample haplotypes (aka “walks”) that visit node \(n\) is represented by \(p_n\).

\(L\) can be computed in one of two ways:

If --metrics sequniq-normwalk is specified, \(L\) is computed as the average length of all walks in the region
If --metrics sequniq-normnode is specified, \(L\) instead represents the average length of all nodes in the region

Usage

panct complexity \
  --region REGION or PATH \
  --metrics sequniq-normwalk,sequniq-normnode \
  --reference REFERENCE_ID \
  --out PATH \
  --verbosity [CRITICAL|ERROR|WARNING|INFO|DEBUG|NOTSET] \
  GFAFILE

Warning

You need an index for the GBZ files, if working with them, or you must have gbz-base installed.

conda install -c conda-forge aryarm::gbz-base

Output

The output is a tab-separated file with the following columns:

numnodes: The number of nodes in the region
total_length: The total length of all nodes in the region
numwalks: The number of walks in the region
The complexity metrics requested by --metrics. Refer to the formulas section.

If the --region option is specified, there will be one line in the output for every region. Each line will also be prefixed by the following columns:

chrom: The chromosome of the region
start: The start position of the region
end: The end position of the region

Examples

By default, tab-separated output is written to standard out.

panct complexity tests/data/basic.gfa

If your input graph is in the GBZ format, you may also use the --region option to select a specific region of the graph in the coordinates of the reference genome. Internally, this uses the gbz-base library to first subset the GBZ to a smaller GFA file.

panct complexity --region chrTest:0-1 tests/data/basic.gbz

You may also specify a list of regions as a BED file, instead. In this case, it might also be helpful to write output to a file.

panct complexity --out basic.tsv --region tests/data/basic.bed tests/data/basic.gbz

All files used in these examples are described here.

Additional examples

Below are additional examples based on the HPRC .gbz format graph (not included in this repo but available here).

# Run on a single region
panct complexity \
  --region chr11:119077050-119178859 \
  --metrics sequniq-normwalk,sequniq-normnode \
  hprc-v1.1-mc-grch38.gbz

# Run on a file with a list of regions
panct complexity \
  --region regions.bed --out test.tsv \
  --metrics sequniq-normwalk,sequniq-normnode \
  hprc-v1.1-mc-grch38.gbz

Detailed Usage

panct

panct: A collection of tools for working with pangenomes

panct [OPTIONS] COMMAND [ARGS]...

Options

-v, --version

Show the application’s version and exit.

Default:: False

--install-completion: Install completion for the current shell.

--show-completion: Show completion for the current shell, to copy it or customize the installation.

complexity

Compute complexity scores

panct complexity [OPTIONS] GRAPH

Options

--region <region>

A region in which to compute complexity, or a BED file of regions

Default:: ''

--metrics <metrics>

Comma-separated list of which complexity metrics to compute. Options: sequniq-normwalk,sequniq-normnode

Default:: 'sequniq-normwalk'

-r, --reference <reference>

The ID of the reference sequence in the GFA file

Default:: 'GRCh38'

-o, --out <output_file>

Name of output file

Default:: PosixPath('/dev/stdout')

-v, --verbosity <verbosity>

The level of verbosity desired

Default:: <Verbosity.info: 'INFO'>
Options:: CRITICAL | ERROR | WARNING | INFO | DEBUG | NOTSET

Arguments

GRAPH

Required argument

Path to the .gfa or .gbz file of a pangenome graph