4.1 HOBIT

HOBIT (Sun et al. 2025) is a statistical method designed to detect homeologs whose expression ratios differ across multiple conditions. It employs a likelihood ratio test (LRT) to compare two hierarchical models: a full model that allows homeolog expression ratios to vary among conditions and a reduced model that assumes constant ratios across all conditions.

HOBIT models homeolog expression with a negative binomial (NB) distribution. Specifically, the expression level of a homeolog on the $i$-th subgenome, denoted as $x_i$, is assumed to follow an NB distribution with mean $\mu\theta_i$ and dispersion $\phi$,

\[ x_{i} \sim NB(\mu\theta_{i}, \phi) \]

where $\mu$ represents the total gene expression across all subgenomes and $\theta_{i}$ denotes the homeolog expression ratio for the $i$-th subgenome.

By default, HOBIT samples the parameters $\mu$ and $\boldsymbol{\theta} = \left(\theta_{1}, \theta_{2}, \cdots \right)$ from non-informative prior (NIP) distributions. The dispersion parameter $\phi$ is estimated using the estimateDisp() function from the edgeR package (Chen et al. 2025). Based on these assumptions, HOBIT performs Markov chain Monte Carlo (MCMC) sampling to fit both the full and reduced models and to compute their respective likelihoods. Then, a LRT is conducted using the likelihoods derived from the two models.

The hobit() function provides an end-to-end interface for parameter estimation, model fitting, likelihood computation, and hypothesis testing. It internally uses the sample() function from the cmdstanr package (Gabry et al. 2025) to perform MCMC sampling, inheriting its configuration options for flexible and scalable inference.

4.1.1 Inputs

The hobit() function requires an input object of class ExpMX. To create this object from real RNA-Seq data, users need to provide the following three components:

a gene expression matrix
a vector or data.frame of condition labels
a mapping table to organize gene expression data into homeolog expression

The following example shows how to load gene expression data, define experimental conditions, and load the mapping table from a file.

The gene expression matrix should be a data.frame or matrix, with row names corresponding to gene names.

gexp <- read.table("../data/c_flexuosa.0516.mini.txt.gz",
                   header = TRUE, sep = "\t", row.names = 1)
head(gexp)

##               wet_1 wet_2 wet_3 dry_1 dry_2 dry_3
## CARHR000190_H   126   109   150   175   113   124
## CARHR000660_H     3     4     8    10     2     5
## CARHR000770_H   129    95   170   233   316   238
## CARHR000890_H    45    32   117     5     6     1
## CARHR001940_H    23    22    73   176    46   108
## CARHR003740_H   541   436   576   552   344   284

The condition labels can be provided as a vector or a data.frame.

group <- c("wet", "wet", "wet", "dry", "dry", "dry")
group

## [1] "wet" "wet" "wet" "dry" "dry" "dry"

The mapping table (mapping_table) is a data.frame in which the first and second columns correspond to gene names from C. hirsuta and C. amara, respectively. It is assumed that all gene names in both columns appear in the row names of gene expression matrix (gexp).

mapping_table <- read.table("../data/c_flexuosa.homeolog.txt.gz",
                            header = TRUE, sep = "\t")
head(mapping_table)

##       C_hirsuta       C_amara
## 1 CARHR000010_H CARHR000010_A
## 2 CARHR000060_H CARHR000060_A
## 3 CARHR000070_H CARHR000070_A
## 4 CARHR000090_H CARHR000090_A
## 5 CARHR000110_H CARHR000110_A
## 6 CARHR000120_H CARHR000120_A

Once the three required components are prepared, an ExpMX object can be created using the newExpMX() function.

x <- newExpMX(gexp, group, mapping_table)
x

## # 2 subgenome sets (C_hirsuta, C_amara)
## # 500 homeolog tuples
## ---------------------
## Experiment Design:
##   group
## 1   wet
## 2   wet
## 3   wet
## 4   dry
## 5   dry
## 6   dry
## ---------------------
## > subgenome: C_hirsuta 
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]  126  109  150  175  113  124
## [2,]    3    4    8   10    2    5
## [3,]  129   95  170  233  316  238
## [4,]   45   32  117    5    6    1
## [5,]   23   22   73  176   46  108
## [6,]  541  436  576  552  344  284
## +++++++++++++++++++++
## > subgenome: C_amara 
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]  238  141   82  136   63   88
## [2,]   14   10   24   23    6   13
## [3,]   20    9   17   26   40   22
## [4,]   23   19   73    2    5    2
## [5,]   15   17   54  209   36   70
## [6,]  798  610  880  958  969  567
## ---------------------

4.1.2 Test

Usually, normalization is required to correct for differences in library sizes across replicates. If the given expression data (gexp) have not been normalized, apply an appropriate normalization method before analysis. In this example, we normalize the data using the norm_counts() function provided in the hespresso package.

x <- norm_counts(x)

By default, norm_counts() function performs TMM normalization by internally calling the edgeR package (Robinson and Oshlack 2010; Chen et al. 2025). However, users may choose to perform normalization using other appropriate packages such as DESeq2 (Love et al. 2014). Note that, this step can be skipped if the expression data has already been normalized, such as FPKM and TPM.

Once the ExpMX object has been prepared and the data have been normalized, homeologs with shifted expression ratios can be detected using HOBIT by running the hobit() function, as shown below.

x_output <- hobit(x)

4.1.3 Outputs

The result (x_output) is a data frame with one row per homeolog. The row order matches the input, and the columns summarize the results of the statistical test. The output includes the following statistics. All statistics are derived from MCMC samples and may differ from those calculated directly from the observed RNA-Seq read counts.

head(x_output)

##            gene    pvalue qvalue raw_pvalue raw_qvalue D__C_hirsuta__(dry-wet) D__C_amara__(dry-wet) OR__C_hirsuta__(dry/wet) OR__C_amara__(dry/wet)        Dmax    ORmax theta0__C_hirsuta theta0__C_amara theta1__C_hirsuta__dry theta1__C_amara__dry theta1__C_hirsuta__wet theta1__C_amara__wet logLik_H0 logLik_H1
## 1 CARHR000190_H 0.1233256      1 0.06541479  0.7656442             0.135262370          -0.135262370                1.7271228              0.5789976 0.135262370 1.727123         0.5139389       0.4860611              0.5834637            0.4165363              0.4460479            0.5539521 -64.49956 -62.84965
## 2 CARHR000660_H 1.0000000      1 1.00000000  1.0000000             0.053684855          -0.053684855                1.3411438              0.7456322 0.053684855 1.341144         0.2533909       0.7466091              0.2769508            0.7230492              0.2212910            0.7787090 -31.81579 -32.06581
## 3 CARHR000770_H 1.0000000      1 1.00000000  1.0000000            -0.002602245           0.002602245                0.9721639              1.0286332 0.002602245 1.028633         0.8930171       0.1069829              0.8936688            0.1063312              0.8960575            0.1039425 -53.30633 -53.69221
## 4 CARHR000890_H 0.7696578      1 0.72625945  1.0000000            -0.084993555           0.084993555                0.6983480              1.4319519 0.084993555 1.431952         0.5863827       0.4136173              0.5442493            0.4557507              0.6292444            0.3707556 -41.19201 -41.26375
## 5 CARHR001940_H 1.0000000      1 1.00000000  1.0000000            -0.048105955           0.048105955                0.8195665              1.2201572 0.048105955 1.220157         0.5520396       0.4479604              0.5294441            0.4705559              0.5774278            0.4225722 -57.14303 -57.51374
## 6 CARHR003740_H 0.3542458      1 0.26803105  1.0000000            -0.089427170           0.089427170                0.6749970              1.4814881 0.089427170 1.481488         0.3601068       0.6398932              0.3143448            0.6856552              0.4038831            0.5961169 -75.39992 -74.78379

gene: Homeolog ID. The gene names defined in the first column of the mapping table.
pvalue: p-value from the LRT using normalized likelihoods.
qvalue: Benjamini–Hochberg adjusted pvalue.
raw_pvalue: p-value from the LRT using raw likelihoods.
raw_qvalue: Benjamini–Hochberg adjusted raw_pvalue.
D__$__*: Difference in expression ratios for subgenome $ between the two groups represented by *.
OR__$__*: Odds ratio of expression ratios for subgenome $ between the two groups represented by *.
Dmax: Maximum absolute difference in expression ratios (D__$__*) observed across all subgenomes and group comparisons.
ORmax: Maximum odds ratio in expression ratios (OR__$__*) observed across all subgenomes and group comparisons.
theta0__$: Posterior expression ratio estimates shared across all conditions, where $ denotes the subgenome name.
theta1__$__*: Posterior expression ratio estimates specific to each condition, where $ denotes the subgenome name and * denotes the group name.
logLik_H0: Log-likelihood of the reduced model.
logLik_H1: Log-likelihood of the full model.

4.1.4 Options

4.1.4.1 Parallelization Options

Below is a basic example of running the hobit() function. By default, hobit() uses multiple threads, defined by the mc.cores option in the global environment, for MCMC sampling. If mc.cores is not set, it defaults to using a single thread.

x_output <- hobit(x)

Users can customize the number of threads to accelerate the testing process by either setting mc.cores (e.g., options(mc.cores = 8)) or specifying the n_threads argument directly. For example:

x_output <- hobit(x, n_threads = 8)

Additionally, user can control parallelization of the MCMC sampling process via the parallel_chains argument. When specified, this argument is passed directly to the sample() function from the cmdstanr package, allowing parallel execution of MCMC chains.

For example, to run four parallel chains, each on a separate thread:

x_output <- hobit(x, n_threads = 1, parallel_chains = 4, chains = 4)

Note that the n_threads and parallel_chains arguments control parallelization at different stages. The n_threads argument controls parallelization during the processing of homeologs, while parallel_chains controls parallelization during MCMC chains. Hence, if both are specified, users should reserve n_threads × parallel_chains threads.

4.1.4.2 Model Options

By default, HOBIT assumes that homeolog expression follows a NB distribution. In this model, the mean is sampled from a NIP distribution, and the dispersion parameter is estimated using the estimateDisp() function from the edgeR package (Chen et al. 2025). In addition to these defaults, HOBIT provides options for users to customize these assumptions as needed.

To switch from the default NB distribution to a zero-inflated negative binomial (ZINB) distribution, user can set dist = "ZINB".

x_output <- hobit(x, dist = "ZINB")

By default, homeolog expression ratios are sampled from NIP distributions, with the constraint that their sum equals 1. Beside the defaults, HOBIT also supports using a Dirichlet prior distribution, with parameters estimated from the observed homeolog expression values. This can be enabled by setting the use_Dirichlet option, as shown below.

x_output <- hobit(x, use_Dirichlet = TRUE)

The default dispersion estimation method requires multiple biological replicates per group. If the dataset contains only a single replicate per group, dispersion estimation will fail. In such cases, HOBIT provides the no_replicate option. When set to TRUE, all replicates are treated as if they belong to a single group for dispersion estimation.

x_output <- hobit(x, no_replicate = TRUE)

Note that while dispersion estimation without replicates can be enabled by setting no_replicate = TRUE, this approach is not recommended for biological experiments. For reliable statistical inference, we strongly encourage reconsidering the experimental design to include multiple biological replicates per group.

4.1.4.3 MCMC Options

Several options control the behavior of MCMC sampling and can significantly affect the accuracy and efficiency of inference. Although hobit() supports all options available in the sample() function from the cmdstanr package, the following are particularly important.

chains: Number of MCMC chains to run (default: 4).
iter_warmup: Number of warmup iterations per chain (default: 1000).
iter_sampling: Number of post-warmup iterations per chain (default: 1000).
save_warmup: Whether to save warmup iterations (default: FALSE).
thin: Interval at which samples are saved (default: 1). Typically unchanged unless memory issues arise.
max_treedepth: Maximum tree depth for the NUTS sampler (default: 10). Increasing it may help resolve convergence issues but slows execution.
adapt_delta: Target acceptance rate for the sampler (default: 0.8). Higher values improve stability but may reduce speed.

The following example increases the number of chains and iterations to improve stability and convergence, though at the cost of longer execution times.

x_output <- hobit(x, chains = 8, iter_warmup = 2000, iter_sampling = 10000)

If warnings or convergence issues arise with the default settings, consider increasing max_treedepth and adapt_delta. Note that increasing these parameters may result in slower sampling.

x_output <- hobit(x, max_treedepth = 12, adapt_delta = 0.95)

For more advanced configuration options, refer to the cmdstanr documentation.

4.1.4.4 Verbosity Options

Verbosity options are available to control the verbosity of progress messages during sampling.

refresh: Frequency of progress updates to the console (default: 100). Set to 0 to suppress updates.
show_messages: Whether to print all diagnostic and progress messages (default: TRUE).

To reduce console output during execution, adjust these settings as shown below.

x_output <- hobit(x, refresh = 0, show_messages = FALSE)