4.1 HOBIT
HOBIT (Sun et al. 2025) is a statistical method designed to detect homeologs whose expression ratios differ across multiple conditions. It employs a likelihood ratio test (LRT) to compare two hierarchical models: a full model that allows homeolog expression ratios to vary among conditions and a reduced model that assumes constant ratios across all conditions.
HOBIT models homeolog expression with a negative binomial (NB) distribution. Specifically, the expression level of a homeolog on the \(i\)-th subgenome, denoted as \(x_i\), is assumed to follow an NB distribution with mean \(\mu\theta_i\) and dispersion \(\phi\),
\[ x_{i} \sim NB(\mu\theta_{i}, \phi) \]
where \(\mu\) represents the total gene expression across all subgenomes and \(\theta_{i}\) denotes the homeolog expression ratio for the \(i\)-th subgenome.
By default, HOBIT samples the parameters
\(\mu\) and \(\boldsymbol{\theta} = \left(\theta_{1}, \theta_{2}, \cdots \right)\)
from non-informative prior (NIP) distributions.
The dispersion parameter \(\phi\) is estimated using the estimateDisp()
function
from the edgeR package (Chen et al. 2025).
Based on these assumptions,
HOBIT performs Markov chain Monte Carlo (MCMC) sampling
to fit both the full and reduced models
and to compute their respective likelihoods.
Then, a LRT is conducted using the likelihoods derived from the two models.
The hobit()
function provides an end-to-end interface for parameter estimation,
model fitting, likelihood computation, and hypothesis testing.
It internally uses the sample()
function from the cmdstanr package (Gabry et al. 2025)
to perform MCMC sampling,
inheriting its configuration options for flexible and scalable inference.
4.1.1 Inputs
The hobit()
function requires an input object of class ExpMX
.
To create this object from real RNA-Seq data,
users need to provide the following three components:
- a gene expression matrix
- a vector or data.frame of condition labels
- a mapping table to organize gene expression data into homeolog expression
The following example shows how to load gene expression data, define experimental conditions, and load the mapping table from a file.
The gene expression matrix should be a data.frame or matrix, with row names corresponding to gene names.
gexp <- read.table("../data/c_flexuosa.0516.mini.txt.gz",
header = TRUE, sep = "\t", row.names = 1)
head(gexp)
## wet_1 wet_2 wet_3 dry_1 dry_2 dry_3
## CARHR000190_H 126 109 150 175 113 124
## CARHR000660_H 3 4 8 10 2 5
## CARHR000770_H 129 95 170 233 316 238
## CARHR000890_H 45 32 117 5 6 1
## CARHR001940_H 23 22 73 176 46 108
## CARHR003740_H 541 436 576 552 344 284
The condition labels can be provided as a vector or a data.frame.
## [1] "wet" "wet" "wet" "dry" "dry" "dry"
The mapping table (mapping_table
) is a data.frame
in which the first and second columns correspond to gene names
from C. hirsuta and C. amara, respectively.
It is assumed that all gene names in both columns appear
in the row names of gene expression matrix (gexp
).
mapping_table <- read.table("../data/c_flexuosa.homeolog.txt.gz",
header = TRUE, sep = "\t")
head(mapping_table)
## C_hirsuta C_amara
## 1 CARHR000010_H CARHR000010_A
## 2 CARHR000060_H CARHR000060_A
## 3 CARHR000070_H CARHR000070_A
## 4 CARHR000090_H CARHR000090_A
## 5 CARHR000110_H CARHR000110_A
## 6 CARHR000120_H CARHR000120_A
Once the three required components are prepared,
an ExpMX
object can be created using the newExpMX()
function.
## # 2 subgenome sets (C_hirsuta, C_amara)
## # 500 homeolog tuples
## ---------------------
## Experiment Design:
## group
## 1 wet
## 2 wet
## 3 wet
## 4 dry
## 5 dry
## 6 dry
## ---------------------
## > subgenome: C_hirsuta
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 126 109 150 175 113 124
## [2,] 3 4 8 10 2 5
## [3,] 129 95 170 233 316 238
## [4,] 45 32 117 5 6 1
## [5,] 23 22 73 176 46 108
## [6,] 541 436 576 552 344 284
## +++++++++++++++++++++
## > subgenome: C_amara
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 238 141 82 136 63 88
## [2,] 14 10 24 23 6 13
## [3,] 20 9 17 26 40 22
## [4,] 23 19 73 2 5 2
## [5,] 15 17 54 209 36 70
## [6,] 798 610 880 958 969 567
## ---------------------
4.1.2 Test
Usually, normalization is required to correct for differences in library sizes
across replicates.
If the given expression data (gexp
) have not been normalized,
apply an appropriate normalization method before analysis.
In this example, we normalize the data using the norm_counts()
function
provided in the hespresso package.
By default, norm_counts()
function performs TMM normalization
by internally calling the edgeR package (Robinson and Oshlack 2010; Chen et al. 2025).
However, users may choose to perform normalization using other appropriate
packages such as DESeq2 (Love et al. 2014).
Note that, this step can be skipped
if the expression data has already been normalized, such as FPKM and TPM.
Once the ExpMX
object has been prepared and the data have been normalized,
homeologs with shifted expression ratios
can be detected using HOBIT by running the hobit()
function, as shown below.
4.1.3 Outputs
The result (x_output
) is a data frame with one row per homeolog.
The row order matches the input,
and the columns summarize the results of the statistical test.
The output includes the following statistics.
All statistics are derived from MCMC samples and
may differ from those calculated directly from the observed RNA-Seq read counts.
## gene pvalue qvalue raw_pvalue raw_qvalue D__C_hirsuta__(dry-wet) D__C_amara__(dry-wet) OR__C_hirsuta__(dry/wet) OR__C_amara__(dry/wet) Dmax ORmax theta0__C_hirsuta theta0__C_amara theta1__C_hirsuta__dry theta1__C_amara__dry theta1__C_hirsuta__wet theta1__C_amara__wet logLik_H0 logLik_H1
## 1 CARHR000190_H 0.1233256 1 0.06541479 0.7656442 0.135262370 -0.135262370 1.7271228 0.5789976 0.135262370 1.727123 0.5139389 0.4860611 0.5834637 0.4165363 0.4460479 0.5539521 -64.49956 -62.84965
## 2 CARHR000660_H 1.0000000 1 1.00000000 1.0000000 0.053684855 -0.053684855 1.3411438 0.7456322 0.053684855 1.341144 0.2533909 0.7466091 0.2769508 0.7230492 0.2212910 0.7787090 -31.81579 -32.06581
## 3 CARHR000770_H 1.0000000 1 1.00000000 1.0000000 -0.002602245 0.002602245 0.9721639 1.0286332 0.002602245 1.028633 0.8930171 0.1069829 0.8936688 0.1063312 0.8960575 0.1039425 -53.30633 -53.69221
## 4 CARHR000890_H 0.7696578 1 0.72625945 1.0000000 -0.084993555 0.084993555 0.6983480 1.4319519 0.084993555 1.431952 0.5863827 0.4136173 0.5442493 0.4557507 0.6292444 0.3707556 -41.19201 -41.26375
## 5 CARHR001940_H 1.0000000 1 1.00000000 1.0000000 -0.048105955 0.048105955 0.8195665 1.2201572 0.048105955 1.220157 0.5520396 0.4479604 0.5294441 0.4705559 0.5774278 0.4225722 -57.14303 -57.51374
## 6 CARHR003740_H 0.3542458 1 0.26803105 1.0000000 -0.089427170 0.089427170 0.6749970 1.4814881 0.089427170 1.481488 0.3601068 0.6398932 0.3143448 0.6856552 0.4038831 0.5961169 -75.39992 -74.78379
gene
: Homeolog ID. The gene names defined in the first column of the mapping table.pvalue
: p-value from the LRT using normalized likelihoods.qvalue
: Benjamini–Hochberg adjustedpvalue
.raw_pvalue
: p-value from the LRT using raw likelihoods.raw_qvalue
: Benjamini–Hochberg adjustedraw_pvalue
.D__$__*
: Difference in expression ratios for subgenome$
between the two groups represented by*
.OR__$__*
: Odds ratio of expression ratios for subgenome$
between the two groups represented by*
.Dmax
: Maximum absolute difference in expression ratios (D__$__*
) observed across all subgenomes and group comparisons.ORmax
: Maximum odds ratio in expression ratios (OR__$__*
) observed across all subgenomes and group comparisons.theta0__$
: Posterior expression ratio estimates shared across all conditions, where$
denotes the subgenome name.theta1__$__*
: Posterior expression ratio estimates specific to each condition, where$
denotes the subgenome name and*
denotes the group name.logLik_H0
: Log-likelihood of the reduced model.logLik_H1
: Log-likelihood of the full model.
4.1.4 Options
4.1.4.1 Parallelization Options
Below is a basic example of running the hobit()
function.
By default, hobit()
uses multiple threads,
defined by the mc.cores
option in the global environment, for MCMC sampling.
If mc.cores
is not set, it defaults to using a single thread.
Users can customize the number of threads to accelerate the testing process
by either setting mc.cores
(e.g., options(mc.cores = 8)
)
or specifying the n_threads
argument directly. For example:
Additionally, user can control parallelization of the MCMC sampling process via the parallel_chains
argument.
When specified, this argument is passed directly to the sample()
function from the
cmdstanr package, allowing parallel execution of MCMC chains.
For example, to run four parallel chains, each on a separate thread:
Note that the n_threads
and parallel_chains
arguments
control parallelization at different stages.
The n_threads
argument controls parallelization during the processing of homeologs,
while parallel_chains
controls parallelization during MCMC chains.
Hence, if both are specified, users should reserve n_threads
× parallel_chains
threads.
4.1.4.2 Model Options
By default, HOBIT assumes that homeolog expression follows a NB distribution.
In this model, the mean is sampled from a NIP distribution,
and the dispersion parameter is estimated using the estimateDisp()
function
from the edgeR package (Chen et al. 2025).
In addition to these defaults,
HOBIT provides options for users to customize these assumptions as needed.
To switch from the default NB distribution to a zero-inflated negative binomial
(ZINB) distribution, user can set dist = "ZINB"
.
By default, homeolog expression ratios are sampled from NIP distributions,
with the constraint that their sum equals 1.
Beside the defaults, HOBIT also supports using a Dirichlet prior distribution,
with parameters estimated from the observed homeolog expression values.
This can be enabled by setting the use_Dirichlet
option, as shown below.
The default dispersion estimation method requires multiple biological replicates per group.
If the dataset contains only a single replicate per group, dispersion estimation will fail.
In such cases, HOBIT provides the no_replicate
option.
When set to TRUE
, all replicates are treated as if they belong to a single group for dispersion estimation.
Note that while dispersion estimation without replicates can be enabled by setting no_replicate = TRUE
,
this approach is not recommended for biological experiments.
For reliable statistical inference,
we strongly encourage reconsidering the experimental design to include multiple biological replicates per group.
4.1.4.3 MCMC Options
Several options control the behavior of MCMC sampling and
can significantly affect the accuracy and efficiency of inference.
Although hobit()
supports all options available in the sample()
function
from the cmdstanr package,
the following are particularly important.
chains
: Number of MCMC chains to run (default:4
).iter_warmup
: Number of warmup iterations per chain (default:1000
).iter_sampling
: Number of post-warmup iterations per chain (default:1000
).save_warmup
: Whether to save warmup iterations (default:FALSE
).thin
: Interval at which samples are saved (default:1
). Typically unchanged unless memory issues arise.max_treedepth
: Maximum tree depth for the NUTS sampler (default:10
). Increasing it may help resolve convergence issues but slows execution.adapt_delta
: Target acceptance rate for the sampler (default:0.8
). Higher values improve stability but may reduce speed.
The following example increases the number of chains and iterations to improve stability and convergence, though at the cost of longer execution times.
If warnings or convergence issues arise with the default settings,
consider increasing max_treedepth
and adapt_delta
.
Note that increasing these parameters may result in slower sampling.
For more advanced configuration options, refer to the cmdstanr documentation.
4.1.4.4 Verbosity Options
Verbosity options are available to control the verbosity of progress messages during sampling.
refresh
: Frequency of progress updates to the console (default:100
). Set to0
to suppress updates.show_messages
: Whether to print all diagnostic and progress messages (default:TRUE
).
To reduce console output during execution, adjust these settings as shown below.