Analysis Workflow in R

xqubit is a Python package typically used in Python environments such as Jupyter notebooks or standalone scripts. However, gene expression data are often preprocessed in R using packages such as DESeq2 or edgeR. For this reason, this tutorial demonstrates how to use xqubit from R via an interface to Python.

We present an example workflow for constructing gene co-expression networks using xqubit, based on a time-series RNA-seq dataset from Arabidopsis thaliana collected after germination, originally generated to study flower initiation [1].

The analysis consists of five main steps:

  1. Data preparation

  2. Quantum state encoding

  3. Fidelity calculation

  4. Network construction

  5. Visualization

Data Preparation

xqubit requires the following inputs:

  • A normalized gene expression matrix

  • An experimental design data frame with condition, timepoint, and replicate columns

  • Gene identifiers (optional)

Expression data should be organized as a matrix with genes as rows and samples as columns. Assuming raw count data are stored in a tab-delimited file, they can be loaded into R as follows:

x <- read.table("ath.counts.txt", header = TRUE, sep = "\t", row.names = 1)

head(x)
##           T07_1 T07_2 T08_1 T08_2 T09_1 T09_2 T10_1 T10_2 T11_1 T11_2 T12_1 ...
## AT1G01010    89   183    57   123    60    75   224   105    62    86    98 ...
## AT1G01020   271   128   167   368   184   170   718   478   126   189   228 ...
## AT1G01030    34     2    20    30    11    14    17    37     1    10    30 ...
## AT1G01040  1021   645   653  1196   611   924  2474  1941   720   410  1036 ...
## AT1G01046     0     0     0     0     0     0     0     0     0     0     0 ...
## AT1G01050  1561  3334  1452  1651  1300  1224  1977  1403  2388  1864  1111 ...

As column names of the expression matrix encode experimental conditions and replicates (condition_replicate), we create the experimental design table by parsing the column names:

x_design <- data.frame(
    sample    = colnames(x),
    condition = sapply(strsplit(colnames(x), "_"), `[`, 1),
    replicate = sapply(strsplit(colnames(x), "_"), `[`, 2),
    timepoint = as.integer(gsub("T", "", sapply(strsplit(colnames(x), "_"), `[`, 1)))
)

x_design
##    sample condition replicate timepoint
## 1   T07_1       T07         1         7
## 2   T07_2       T07         2         7
## 3   T08_1       T08         1         8
## 4   T08_2       T08         2         8
## 5   T09_1       T09         1         9
## ...

Raw read counts often contain technical biases that can obscure biological signals. Normalization and gene filtering are therefore critical preprocessing steps before constructing networks.

Here we apply variance stabilizing transformation (VST), implemented in DESeq2 [2], for raw counts. One can use other normalization methods as well, such as regularized log transformation of TPM values or log transformed TMM counts.

library(DESeq2)

dds <- DESeqDataSetFromMatrix(x, x_design, design = ~ condition)
dds <- estimateSizeFactors(dds)
x_norm <- assay(vst(dds, blind = TRUE))

Since invariant or lowly expressed genes can generate spurious similarities, genes with low expression or low variability should be removed. The thresholds used below are example values and should be adjusted depending on the characteristics of the dataset.

f1 <- (rowSums(x_norm >= 5) >= (ncol(x_norm) / 3))
f2 <- (apply(x_norm, 1, mad) > 0.5)
x_norm <- x_norm[f1 & f2, ]

The resulting matrix x_norm contains normalized expression values for genes with sufficient expression and variability, ready for downstream analysis with xqubit.

dim(x_norm)
## [1] 1156   20

head(x_norm)
##               T07_1     T07_2     T08_1     T08_2     T09_1     T09_2    T10_1 ...
## AT1G01070 10.813749 11.768753 11.404468 11.101382 11.188867 10.918834 8.869768 ...
## AT1G01110  8.806264  9.565552  8.780190  8.890654  8.613121  9.180422 9.256020 ...
## AT1G01170 10.051471  9.732266 10.286921  9.787186 10.216392  9.812231 9.358475 ...
## AT1G01190 10.285683  9.978618 10.705898 10.702856 10.268512 10.328980 8.509345 ...
## AT1G01470 10.144925  9.706032 10.369087 10.252375 10.426127 10.194991 8.499688 ...
## AT1G01580  7.961483  8.341188  7.890097  8.035285  8.010770  8.045538 7.551140 ...

Quantum State Encoding

After preparing the data, gene expression profiles are embedded into quantum states using xqubit. Because xqubit is implemented in Python, we access it from R via the reticulate package, which provides an interface for calling Python functions from within R.

First, load reticulate and specify the Python environment where xqubit is installed:

library(reticulate)

use_python("/bin/python", required = TRUE)  # adjust path as needed

Next, import the xqubit module in R and create a xqubit.ExpressionData object using the normalized expression matrix and the experimental design data frame.

xqubit <- import("xqubit")

d <- xqubit$ExpressionData(as.matrix(x_norm), x_design, rownames(x_norm))

Then, build quantum-state embeddings.

psi <- xqubit$qstate$build(d, encoding='TDP')

The expression profile of each gene can be visualized using the xqubit.ExpressionData.plot method, where the x-axis represents conditions (time points) and the y-axis represents normalized expression values.

d$plot('AT5G61850', file_name='AT5G61850.exp.png')
Visualization of expression profile of AT5G61850

The corresponding quantum state can be visualized in the complex plane using the xqubit.qstate.plot method.

idx <- d$genes('AT5G61850')
xqubit$qstate$plot(psi, idx, file_name='AT5G61850.qstate.png')
Visualization of quantum state of AT5G61850 in the complex plane

Fidelity Calculation

Fidelity between genes is calculated using the xqubit.metrics.calc_fidelity function, which takes the psi as input and returns a fidelity matrix where rows and columns correspond to genes and values represent pairwise similarity.

fmat <- xqubit$metrics$calc_fidelity(psi)

dim(fmat)
## [1] 1156 1156

head(fmat)
##           [,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7] ...
## [1,] 1.0000000 0.8027602 0.7253503 0.9887822 0.9910231 0.9532281 0.8772183 ...
## [2,] 0.8027602 1.0000000 0.5417154 0.7918268 0.7628359 0.9059334 0.9328506 ...
## [3,] 0.7253503 0.5417154 1.0000000 0.6901691 0.7442489 0.6326834 0.7284880 ...
## [4,] 0.9887822 0.7918268 0.6901691 1.0000000 0.9875654 0.9556262 0.8796378 ...
## [5,] 0.9910231 0.7628359 0.7442489 0.9875654 1.0000000 0.9366521 0.8680249 ...
## [6,] 0.9532281 0.9059334 0.6326834 0.9556262 0.9366521 1.0000000 0.9344884 ...

Network Construction

To construct a gene-gene network based on fidelity similarities, we use functions from the xqubit.nx module, which implements a flexible framework for building and analyzing networks.

There are three parameters that control network construction and community detection:

  • f: minimum fidelity value required for a gene-gene connection

  • k: number of nearest neighbors retained per gene

  • resolution: resolution parameter for Leiden community detection

To identify suitable parameter values, a grid search can be performed. The xqubit.nx.scan_network_params method evaluates each network based on modularity, stability, number of detected communities, and other metrics.

nx_params <- xqubit$nx$scan_network_params(
            fmat,
            s_cutoffs = c(0.6, 0.7, 0.8, 0.9),
            k_cutoffs = c(10, 20, 30),
            resolutions = c(1.0, 1.5))

The xqubit.nx.rank_network_params function ranks parameter combinations based on predefined acceptance ranges and Pareto optimality with respect to selected evaluation metrics. Using the default acceptance ranges and optimization variables, we can inspect the ranked parameter combinations as follows:

nx_params <- xqubit$nx$rank_network_params(nx_params)

nx_params <- data.frame(nx_params$to_dict(orient='list'))
best_params <- nx_params[1, ]

Note that user can customize the acceptance ranges and optimization variables by passing the opt_vars argument to xqubit.nx.rank_network_params.

Using the selected parameters, the fidelity network is constructed and communities are detected:

nx <- xqubit$nx$build_network(fmat, s = best_params$s, k = best_params$k)
cl <- xqubit$nx$detect_communities(nx, resolution = best_params$resolution)

cl is a vector of community labels for each gene, where labels start from 0 and genes sharing the same label belong to the same community. Genes with no connections, as well as those in small communities that do not meet the minimum size requirement, are assigned to community 0. Thus, community 0 is typically excluded from downstream analyses.

table(cl)
##   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14
## 117 134 134  94  91  89  82  70  70  69  69  48  39  25  25

Communities can also be saved to a file:

comms_df <- data.frame(gene = rownames(x_norm), community = cl)

write.table(comms_df, "communities.txt",
            sep = "\t", row.names = FALSE, col.names = TRUE, quote = FALSE)

Visualization

Visualization utilities are available for inspecting expression patterns within communities.

z <- d$exp(avg_replicates = TRUE, zscore = TRUE)
xqubit$nx$plot_communities(cl, z, file_name = "comm_expression.png")
Gene expression profiles within communities

References