3.2 Allohexaploid T. aestivum
3.2.1 Introduction
Bread wheat, Triticum aestivum (2n = 6x = 42, AABBDD), is one of the world’s most important crops and also serves as a model for studying allopolyploid species. It originated through hybridization among three diploid progenitor species, Triticum urartu (2n = 2x = 14, AA), an unknown species related to Aegilops speltoides (2n = 2x = 14; SS, which is thought to be the progenitor of BB), and Aegilops tauschii (2n = 2x = 14, DD), and contains three distinct subgenomes (IWGSC 2018; Levy and Feldman 2022; Zhao et al. 2023). To demonstrate the applicability of HOBIT to allohexaploid species, we present a case study using wheat RNA-Seq data to detect homeologs with shifts in expression ratios between two tissues, shoot apex and leaf (Yang et al. 2021).
3.2.2 Data Preparation
In this example, 500 homeolog triads were randomly selected from the original dataset.
The expression data can be loaded from the following file (t_aestivum.w20.mini.txt.gz
).
Users may also choose to load the full dataset, which is stored
in the data directory, for analysis.
gexp <- read.table("../data/t_aestivum.w20.mini.txt.gz",
header = TRUE, sep = "\t", row.names = 1)
head(gexp)
## shootapex_1 shootapex_2 shootapex_3 leaf_1 leaf_2 leaf_3
## TraesCS1A02G024700 65 69 53 63 51 39
## TraesCS1A02G029300 206 164 227 1439 1125 1444
## TraesCS1A02G046900 49 44 41 1 0 0
## TraesCS1A02G063600 111 70 77 4 10 6
## TraesCS1A02G067300 1212 1247 1274 22 21 24
## TraesCS1A02G072800 133 143 143 0 0 0
Next, load the mapping table that links each gene to its corresponding homeolog triads. Since wheat has three subgenomes (i.e., A, B, and D), the table contains three columns, with each column representing one subgenome.
mapping_table <- read.table("../data/t_aestivum.homeolog.txt.gz",
header = TRUE, sep = "\t")
head(mapping_table)
## A B D
## 1 TraesCS1A02G000800 TraesCS6B02G173200LC TraesCS5D02G023700
## 2 TraesCS1A02G000900 TraesCS6B02G117800 TraesCS7D02G119300
## 3 TraesCS1A02G001600 TraesCS1B02G320300 TraesCS7D02G148100LC
## 4 TraesCS1A02G001900 TraesCS1B02G000800 TraesCS1D02G003900
## 5 TraesCS1A02G002000 TraesCS1B02G000900 TraesCS1D02G004000
## 6 TraesCS1A02G002100 TraesCS1B02G001000 TraesCS1D02G004100
We then construct an ExpMX
class object using the newExpMX()
function.
This converts the gene expression matrix (gexp
)
into a homeolog expression matrix based on the mapping table (mapping_table
).
## # 3 subgenome sets (A, B, D)
## # 500 homeolog tuples
## ---------------------
## Experiment Design:
## group
## 1 apex
## 2 apex
## 3 apex
## 4 leaf
## 5 leaf
## 6 leaf
## ---------------------
## > subgenome: A
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 65 69 53 63 51 39
## [2,] 206 164 227 1439 1125 1444
## [3,] 49 44 41 1 0 0
## [4,] 111 70 77 4 10 6
## [5,] 1212 1247 1274 22 21 24
## [6,] 133 143 143 0 0 0
## +++++++++++++++++++++
## > subgenome: B
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 91 92 119 60 87 64
## [2,] 4 0 0 622 285 383
## [3,] 85 118 100 0 1 7
## [4,] 122 108 116 5 16 12
## [5,] 1540 1426 1530 18 21 35
## [6,] 158 168 163 0 0 0
## +++++++++++++++++++++
## > subgenome: D
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 89 107 83 180 150 111
## [2,] 155 232 176 2639 1860 2286
## [3,] 107 115 82 0 0 0
## [4,] 59 48 63 4 6 4
## [5,] 1163 1097 1127 18 10 27
## [6,] 137 139 124 1 1 0
## ---------------------
Before performing the test,
we normalize the raw read counts using the TMM method (Robinson and Oshlack 2010)
to adjust for differences in library size.
However, if the expression data (gexp
) has already been normalized (e.g. FPKM),
this step can be skipped.
To visualize the distributions of homeolog expression ratios, use the plot_HER_distr()
function.
By default, the expression ratio is calculated as the proportion contributed
by the first subgenome relative to all subgenomes.
In this dataset, the first subgenome corresponds to the A-subgenome,
as indicated by the first column of the mapping table (mapping_table
).
## [1] "apex" "leaf"
library(ggplot2)
library(gridExtra)
grid.arrange(distr_plots_A[["apex"]] + ggtitle("apex on A subgenome"),
distr_plots_A[["leaf"]] + ggtitle("leaf on A subgenome"),
ncol = 2)

Figure 3.5: Distribution of homeolog expression ratios between shoot apex and leaf in the A-subgenome of wheat.
To calculate expression ratios from the second or third subgenome,
set the base
option accordingly.
For example, to calculate the ratios for the second and third subgenomes
(i.e., B-subgenome and D-subgenome), set base = 2
and base = 3
, respectively.
distr_plots_B <- plot_HER_distr(x, base = 2)
distr_plots_D <- plot_HER_distr(x, base = 3)
grid.arrange(distr_plots_A[["apex"]] + ggtitle("apex on A"),
distr_plots_A[["leaf"]] + ggtitle("leaf on A"),
distr_plots_B[["apex"]] + ggtitle("apex on B"),
distr_plots_B[["leaf"]] + ggtitle("leaf on B"),
distr_plots_D[["apex"]] + ggtitle("apex on D"),
distr_plots_D[["leaf"]] + ggtitle("leaf on D"),
ncol = 2)

Figure 3.6: Distribution of homeolog expression ratios between shoot apex and leaf across all subgenomes.
Histogram peaks at one-third for all subgenomes indicate that most homeologs in wheat are equally expressed from the three subgenomes.
3.2.3 Statistical Test
After visually inspecting the distributions of homeolog expression ratios,
use the hobit()
function to identify homeologs
showing significant changes in expression ratios between shoot apex and leaf tissues.
This step takes approximately five minutes using eight threads and may vary depending on hardware performance.
## gene pvalue qvalue raw_pvalue raw_qvalue D__A__(apex-leaf) D__B__(apex-leaf) D__D__(apex-leaf) OR__A__(apex/leaf) OR__B__(apex/leaf) OR__D__(apex/leaf) Dmax ORmax theta0__A theta0__B theta0__D theta1__A__apex theta1__B__apex theta1__D__apex theta1__A__leaf theta1__B__leaf theta1__D__leaf logLik_H0 logLik_H1
## 1 TraesCS1A02G024700 2.616739e-02 6.142580e-02 5.472214e-03 1.284557e-02 0.05167922 0.12996055 -0.17928711 1.3623240 1.81640254 0.4798122 0.17928711 2.084149 0.2173702 0.32747665 0.4526210 0.2429172 0.392673165 0.3628660 0.1907620 0.2621193 0.54235111 -81.88117 -76.75839
## 2 TraesCS1A02G029300 3.706660e-12 9.754368e-11 4.562229e-17 1.200587e-15 0.17852831 -0.10455004 -0.06845847 2.0997239 0.03141687 0.7594719 0.17852831 31.830030 0.4195390 0.05651968 0.5208880 0.5095908 0.003843973 0.4859704 0.3308907 0.1087523 0.55642154 -138.05460 -100.55997
## 3 TraesCS1A02G046900 4.416141e-08 6.494324e-07 3.061757e-11 4.502584e-10 0.03238216 -0.34732169 0.34336341 1.2643545 0.22193465 10.4774474 0.34732169 10.477447 0.1644786 0.58137584 0.2378548 0.1800823 0.408631990 0.4094569 0.1468518 0.7562532 0.06188941 -69.43195 -45.31271
## 4 TraesCS1A02G063600 5.812801e-01 7.221219e-01 4.604533e-01 5.723358e-01 0.03230882 -0.02971638 0.01087286 1.1652031 0.88727255 1.0652878 0.03230882 1.165203 0.3150442 0.46300573 0.2154362 0.3320959 0.446646310 0.2200791 0.2976223 0.4773411 0.21030973 -60.00160 -59.24261
## 5 TraesCS1A02G067300 2.436682e-02 5.857410e-02 4.942007e-03 1.187983e-02 -0.01115748 0.01800109 0.00755438 0.9511745 1.08020140 1.0380239 0.01800109 1.080201 0.3270065 0.37708495 0.2883377 0.3216624 0.386236395 0.2917494 0.3331768 0.3680521 0.28392523 -89.19414 -83.90700
## 6 TraesCS1A02G072800 3.086719e-09 6.018062e-08 6.825001e-13 1.338510e-11 0.15418804 0.20978820 -0.28546067 2.3896451 3.07768136 0.3032963 0.28546067 3.297106 0.2442037 0.26838213 0.4495816 0.3203725 0.373202695 0.3049541 0.1666320 0.1617500 0.59349763 -69.72687 -41.46689
The output can be sorted in ascending order of p-values using the following code.
## gene pvalue qvalue raw_pvalue raw_qvalue D__A__(apex-leaf) D__B__(apex-leaf) D__D__(apex-leaf) OR__A__(apex/leaf) OR__B__(apex/leaf) OR__D__(apex/leaf) Dmax ORmax theta0__A theta0__B theta0__D theta1__A__apex theta1__B__apex theta1__D__apex theta1__A__leaf theta1__B__leaf theta1__D__leaf logLik_H0 logLik_H1
## 441 TraesCS7A02G209600 1.712481e-34 8.562406e-32 5.375601e-49 2.687801e-46 0.47020891 -0.7240464 0.2527981 43.6406630 0.01364241 18.4817071 0.7240464 73.300842 0.2582908 0.5931562 0.1478044 0.4931408 0.23092207 0.2742402 0.02185092 0.9567519 0.02006728 -180.7215 -69.61950
## 62 TraesCS1A02G445500 6.065366e-29 1.516341e-26 4.604170e-41 1.151043e-38 -0.02651230 0.2702367 -0.2346247 0.8703762 4.38084864 0.3815521 0.2702367 4.380849 0.2607835 0.2704003 0.4634814 0.2480120 0.40542386 0.3459432 0.27416205 0.1346225 0.58084644 -191.3739 -98.74309
## 299 TraesCS5A02G022200 4.338958e-23 7.231597e-21 1.077170e-32 1.795283e-30 0.35991232 -0.1514078 -0.2057361 5.3638957 0.18918271 0.4329295 0.3599123 5.363896 0.3596230 0.1200954 0.5183583 0.5396559 0.04394851 0.4159984 0.17966007 0.1959991 0.62208898 -183.3925 -109.95285
## 268 TraesCS4A02G250600 3.364910e-21 4.206138e-19 5.413518e-30 6.766898e-28 -0.05530722 0.2734420 -0.1529501 0.7621023 4.46732813 0.5272217 0.2734420 4.467328 0.2874667 0.2725191 0.4078531 0.2600167 0.40953660 0.3297682 0.31379598 0.1350041 0.48497640 -119.9200 -52.04104
## 422 TraesCS6A02G314400 1.413693e-20 1.413693e-18 4.213261e-29 4.213261e-27 -0.30485015 0.2158651 0.1043494 0.2833779 3.28910794 1.7963450 0.3048501 3.528857 0.4963563 0.2568841 0.2380839 0.3444890 0.36464850 0.2904999 0.64884277 0.1486107 0.18558789 -150.7537 -85.65136
## 439 TraesCS7A02G197400 2.352092e-20 1.960077e-18 8.723259e-29 7.269383e-27 0.02933763 0.2274193 -0.2474215 1.1458380 4.28143124 0.3610695 0.2474215 4.281431 0.3168301 0.2190836 0.4591570 0.3313281 0.33279515 0.3354008 0.30198738 0.1048420 0.58218999 -143.3949 -78.97352
Next, we visualize changes in expression ratios of the first subgenome between the two tissues. In the plot, homeologs with q-values less than 0.01 are highlighted.

Figure 3.7: Changes in homeolog expression ratios of the A-subgenome between shoot apex and leaf tissues. Each point represents the ratio in shoot apex (x-axis) and leaf (y-axis). Orange points indicate homeologs with significant changes, gray points indicate homeologs without significant changes.
To change the subgenome used for calculating expression ratios,
set the base
argument to the corresponding subgenome number.
For example, to use the second subgenome, run the following.

Figure 3.8: Changes in homeolog expression ratios of the B-subgenome between shoot apex and leaf tissues. Each point represents the ratio in shoot apex (x-axis) and leaf (y-axis). Orange points indicate homeologs with significant changes, gray points indicate homeologs without significant changes.
Note that expression ratios are calculated for the selected subgenome, while highlighted homeologs are those that meet the significance threshold in any subgenome. As a result, some homeologs may appear highlighted even if they show only minor changes in the plot, because they may have significantly altered expression ratios in another subgenome.
3.2.4 Combining Subgenomes
Additionally, users who want to test expression ratios
between the combined A- and B-subgenomes versus the D-subgenome
can first merge the expression values of the A and B homeologs
using the combine_hexp()
function.
After combining, the statistical test can be performed on the merged dataset.
Assuming the A- and B-subgenomes are stored as the first and second subgenomes
in the x
object (as defined in the mapping_table
),
set subgenomes = c(1, 2)
to combine them.
## # 2 subgenome sets (D, A+B)
## # 500 homeolog tuples
## ---------------------
## Experiment Design:
## group
## 1 apex
## 2 apex
## 3 apex
## 4 leaf
## 5 leaf
## 6 leaf
## ---------------------
## > subgenome: D
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 69.91587 92.91579 69.32209 233.369104 182.885467 141.103672
## [2,] 121.76359 201.46227 146.99623 3421.450364 2267.779789 2905.972925
## [3,] 84.05615 99.86276 68.48688 0.000000 0.000000 0.000000
## [4,] 46.34872 41.68185 52.61797 5.185980 7.315419 5.084817
## [5,] 913.61969 952.60392 941.27700 23.336910 12.192364 34.322515
## [6,] 107.62330 120.70369 103.56553 1.296495 1.219236 0.000000
## +++++++++++++++++++++
## > subgenome: A+B
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 122.5492 139.8079 143.6554 159.468888 168.254630 130.93404
## [2,] 164.9700 142.4130 189.5917 2672.076241 1719.123389 2322.49017
## [3,] 105.2666 140.6762 117.7640 1.296495 1.219236 8.89843
## [4,] 183.0382 154.5702 161.1947 11.668455 31.700148 22.88168
## [5,] 2161.8929 2321.1579 2341.9172 51.859801 51.207931 75.00105
## [6,] 228.6013 270.0636 255.5730 0.000000 0.000000 0.00000
## ---------------------
HOBIT can then be applied to the combined dataset as below.
## gene pvalue qvalue raw_pvalue raw_qvalue D__D__(apex-leaf) D__A+B__(apex-leaf) OR__D__(apex/leaf) OR__A+B__(apex/leaf) Dmax ORmax theta0__D theta0__A+B theta1__D__apex theta1__A+B__apex theta1__D__leaf theta1__A+B__leaf logLik_H0 logLik_H1
## 62 TraesCS1A02G445500 3.848256e-17 1.924128e-14 7.962699e-24 3.981349e-21 -0.2487453 0.2487453 0.3603239 2.77528074 0.2487453 2.775281 0.4700614 0.5299386 0.3466064 0.6533936 0.59447553 0.4055245 -119.94996 -69.52158
## 6 TraesCS1A02G072800 1.544315e-13 3.860788e-11 1.069171e-18 2.672927e-16 -0.4803676 0.4803676 0.1181814 8.46157415 0.4803676 8.461574 0.5480605 0.4519395 0.3055404 0.6944596 0.78982973 0.2101703 -69.10290 -30.18006
## 107 TraesCS2A02G262400 1.914063e-12 3.190105e-10 3.832662e-17 6.387770e-15 -0.5210973 0.5210973 0.0979703 10.20720894 0.5210973 10.207209 0.4978944 0.5021056 0.2352690 0.7647310 0.75914356 0.2408564 -67.82316 -32.54267
## 441 TraesCS7A02G209600 1.102556e-11 1.378195e-09 4.616052e-16 5.770065e-14 0.2523203 -0.2523203 18.6952750 0.05348946 0.2523203 18.695275 0.1472152 0.8527848 0.2732705 0.7267295 0.01975070 0.9802493 -84.38861 -51.58576
## 355 TraesCS5A02G352700 2.695565e-11 2.603103e-09 1.644148e-15 1.644148e-13 0.3474770 -0.3474770 7.3613621 0.13584442 0.3474770 7.361362 0.2764225 0.7235775 0.4501455 0.5498545 0.09966664 0.9003334 -66.77285 -35.17498
## 268 TraesCS4A02G250600 3.123723e-11 2.603103e-09 2.027057e-15 1.689214e-13 -0.2796697 0.2796697 0.3156999 3.16756473 0.2796697 3.167565 0.4699997 0.5300003 0.3297345 0.6702655 0.61045817 0.3895418 -69.60987 -38.07149
3.2.5 Analysis Environment
## R version 4.5.1 (2025-06-13)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8 LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8
## [6] LC_MESSAGES=C.UTF-8 LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## time zone: UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] UpSetR_1.4.0 gridExtra_2.3 ggplot2_4.0.0 hespresso_1.0.1
##
## loaded via a namespace (and not attached):
## [1] tensorA_0.36.2.1 sass_0.4.10 future_1.67.0 generics_0.1.4 lattice_0.22-7
## [6] hms_1.1.3 listenv_0.9.1 digest_0.6.37 magrittr_2.0.4 evaluate_1.0.5
## [11] grid_4.5.1 RColorBrewer_1.1-3 bookdown_0.44 fastmap_1.2.0 plyr_1.8.9
## [16] jsonlite_2.0.0 processx_3.8.6 progress_1.2.3 backports_1.5.0 limma_3.64.3
## [21] ps_1.9.1 scales_1.4.0 codetools_0.2-20 jquerylib_0.1.4 abind_1.4-8
## [26] cli_3.6.5 rlang_1.1.6 crayon_1.5.3 cmdstanr_0.9.0 parallelly_1.45.1
## [31] future.apply_1.20.0 splines_4.5.1 withr_3.0.2 cachem_1.1.0 yaml_2.3.10
## [36] tools_4.5.1 parallel_4.5.1 checkmate_2.3.3 locfit_1.5-9.12 globals_0.18.0
## [41] vctrs_0.6.5 posterior_1.6.1 R6_2.6.1 lifecycle_1.0.4 edgeR_4.6.3
## [46] pkgconfig_2.0.3 progressr_0.16.0 pillar_1.11.1 bslib_0.9.0 gtable_0.3.6
## [51] Rcpp_1.1.0 data.table_1.17.8 glue_1.8.0 statmod_1.5.0 xfun_0.53
## [56] tibble_3.3.0 knitr_1.50 farver_2.1.2 htmltools_0.5.8.1 labeling_0.4.3
## [61] rmarkdown_2.29 compiler_4.5.1 prettyunits_1.2.0 S7_0.2.0 distributional_0.5.0