complexity

Compute the complexity of a pangenome graph.

The complexity command outputs a file with complexity metrics for an entire graph or for a specified set of regions from a graph.

If a GFA file is provided, the whole graph is processed.

If a GBZ file is provided, you must specify a region or list of regions (as a BED file).

Formulas

The complexity of a region is computed according to the following.

\[\sum_n \frac{|n|*p_n*(1-p_n)}L\]

For a node \(n\) in the region, \(|n|\) represents the length (in base pairs) of the sequence. So a node with sequence ‘ATGAC’ would have \(|n|=5\), for example.

The percent of sample haplotypes (aka “walks”) that visit node \(n\) is represented by \(p_n\).

\(L\) can be computed in one of two ways:

  1. If --metrics sequniq-normwalk is specified, \(L\) is computed as the average length of all walks in the region

  2. If --metrics sequniq-normnode is specified, \(L\) instead represents the average length of all nodes in the region

Usage

panct complexity \
  --region REGION or PATH \
  --metrics sequniq-normwalk,sequniq-normnode \
  --reference REFERENCE_ID \
  --out PATH \
  --verbosity [CRITICAL|ERROR|WARNING|INFO|DEBUG|NOTSET] \
  GFAFILE

Warning

You need an index for the GBZ files, if working with them, or you must have gbz-base installed.

conda install -c conda-forge aryarm::gbz-base

Output

The output is a tab-separated file with the following columns:

  1. numnodes: The number of nodes in the region

  2. total_length: The total length of all nodes in the region

  3. numwalks: The number of walks in the region

  4. The complexity metrics requested by --metrics. Refer to the formulas section.

If the --region option is specified, there will be one line in the output for every region. Each line will also be prefixed by the following columns:

  1. chrom: The chromosome of the region

  2. start: The start position of the region

  3. end: The end position of the region

Examples

By default, tab-separated output is written to standard out.

panct complexity tests/data/basic.gfa

If your input graph is in the GBZ format, you may also use the --region option to select a specific region of the graph in the coordinates of the reference genome. Internally, this uses the gbz-base library to first subset the GBZ to a smaller GFA file.

panct complexity --region chrTest:0-1 tests/data/basic.gbz

You may also specify a list of regions as a BED file, instead. In this case, it might also be helpful to write output to a file.

panct complexity --out basic.tsv --region tests/data/basic.bed tests/data/basic.gbz

All files used in these examples are described here.

Additional examples

Below are additional examples based on the HPRC .gbz format graph (not included in this repo but available here).

# Run on a single region
panct complexity \
  --region chr11:119077050-119178859 \
  --metrics sequniq-normwalk,sequniq-normnode \
  hprc-v1.1-mc-grch38.gbz

# Run on a file with a list of regions
panct complexity \
  --region regions.bed --out test.tsv \
  --metrics sequniq-normwalk,sequniq-normnode \
  hprc-v1.1-mc-grch38.gbz

Detailed Usage

panct

panct: A collection of tools for working with pangenomes

panct [OPTIONS] COMMAND [ARGS]...

Options

-v, --version

Show the application’s version and exit.

Default:

False

--install-completion

Install completion for the current shell.

--show-completion

Show completion for the current shell, to copy it or customize the installation.

complexity

Compute complexity scores

panct complexity [OPTIONS] GRAPH

Options

--region <region>

A region in which to compute complexity, or a BED file of regions

Default:

''

--metrics <metrics>

Comma-separated list of which complexity metrics to compute. Options: sequniq-normwalk,sequniq-normnode

Default:

'sequniq-normwalk'

-r, --reference <reference>

The ID of the reference sequence in the GFA file

Default:

'GRCh38'

-o, --out <output_file>

Name of output file

Default:

PosixPath('/dev/stdout')

-v, --verbosity <verbosity>

The level of verbosity desired

Default:

<Verbosity.info: 'INFO'>

Options:

CRITICAL | ERROR | WARNING | INFO | DEBUG | NOTSET

Arguments

GRAPH

Required argument

Path to the .gfa or .gbz file of a pangenome graph