3.2 Allohexaploid T. aestivum

3.2.1 Introduction

Bread wheat, Triticum aestivum (2n = 6x = 42, AABBDD), is one of the world’s most important crops and also serves as a model for studying allopolyploid species. It originated through hybridization among three diploid progenitor species, Triticum urartu (2n = 2x = 14, AA), an unknown species related to Aegilops speltoides (2n = 2x = 14; SS, which is thought to be the progenitor of BB), and Aegilops tauschii (2n = 2x = 14, DD), and contains three distinct subgenomes (IWGSC 2018; Levy and Feldman 2022; Zhao et al. 2023). To demonstrate the applicability of HOBIT to allohexaploid species, we present a case study using wheat RNA-Seq data to detect homeologs with shifts in expression ratios between two tissues, shoot apex and leaf (Yang et al. 2021).

3.2.2 Data Preparation

In this example, 500 homeolog triads were randomly selected from the original dataset. The expression data can be loaded from the following file (t_aestivum.w20.mini.txt.gz). Users may also choose to load the full dataset, which is stored in the data directory, for analysis.

gexp <- read.table("../data/t_aestivum.w20.mini.txt.gz",
                   header = TRUE, sep = "\t", row.names = 1)
head(gexp)
##                    shootapex_1 shootapex_2 shootapex_3 leaf_1 leaf_2 leaf_3
## TraesCS1A02G024700          65          69          53     63     51     39
## TraesCS1A02G029300         206         164         227   1439   1125   1444
## TraesCS1A02G046900          49          44          41      1      0      0
## TraesCS1A02G063600         111          70          77      4     10      6
## TraesCS1A02G067300        1212        1247        1274     22     21     24
## TraesCS1A02G072800         133         143         143      0      0      0
group <- c("apex", "apex", "apex", "leaf", "leaf", "leaf")

Next, load the mapping table that links each gene to its corresponding homeolog triads. Since wheat has three subgenomes (i.e., A, B, and D), the table contains three columns, with each column representing one subgenome.

mapping_table <- read.table("../data/t_aestivum.homeolog.txt.gz",
                            header = TRUE, sep = "\t")
head(mapping_table)
##                    A                    B                    D
## 1 TraesCS1A02G000800 TraesCS6B02G173200LC   TraesCS5D02G023700
## 2 TraesCS1A02G000900   TraesCS6B02G117800   TraesCS7D02G119300
## 3 TraesCS1A02G001600   TraesCS1B02G320300 TraesCS7D02G148100LC
## 4 TraesCS1A02G001900   TraesCS1B02G000800   TraesCS1D02G003900
## 5 TraesCS1A02G002000   TraesCS1B02G000900   TraesCS1D02G004000
## 6 TraesCS1A02G002100   TraesCS1B02G001000   TraesCS1D02G004100

We then construct an ExpMX class object using the newExpMX() function. This converts the gene expression matrix (gexp) into a homeolog expression matrix based on the mapping table (mapping_table).

x <- newExpMX(gexp, group, mapping_table)
x
## # 3 subgenome sets (A, B, D)
## # 500 homeolog tuples
## ---------------------
## Experiment Design:
##   group
## 1  apex
## 2  apex
## 3  apex
## 4  leaf
## 5  leaf
## 6  leaf
## ---------------------
## > subgenome: A 
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]   65   69   53   63   51   39
## [2,]  206  164  227 1439 1125 1444
## [3,]   49   44   41    1    0    0
## [4,]  111   70   77    4   10    6
## [5,] 1212 1247 1274   22   21   24
## [6,]  133  143  143    0    0    0
## +++++++++++++++++++++
## > subgenome: B 
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]   91   92  119   60   87   64
## [2,]    4    0    0  622  285  383
## [3,]   85  118  100    0    1    7
## [4,]  122  108  116    5   16   12
## [5,] 1540 1426 1530   18   21   35
## [6,]  158  168  163    0    0    0
## +++++++++++++++++++++
## > subgenome: D 
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]   89  107   83  180  150  111
## [2,]  155  232  176 2639 1860 2286
## [3,]  107  115   82    0    0    0
## [4,]   59   48   63    4    6    4
## [5,] 1163 1097 1127   18   10   27
## [6,]  137  139  124    1    1    0
## ---------------------

Before performing the test, we normalize the raw read counts using the TMM method (Robinson and Oshlack 2010) to adjust for differences in library size. However, if the expression data (gexp) has already been normalized (e.g. FPKM), this step can be skipped.

x <- norm_counts(x)

To visualize the distributions of homeolog expression ratios, use the plot_HER_distr() function. By default, the expression ratio is calculated as the proportion contributed by the first subgenome relative to all subgenomes. In this dataset, the first subgenome corresponds to the A-subgenome, as indicated by the first column of the mapping table (mapping_table).

distr_plots_A <- plot_HER_distr(x)
names(distr_plots_A)
## [1] "apex" "leaf"
library(ggplot2)
library(gridExtra)
grid.arrange(distr_plots_A[["apex"]] + ggtitle("apex on A subgenome"),
             distr_plots_A[["leaf"]] + ggtitle("leaf on A subgenome"),
             ncol = 2)
Distribution of homeolog expression ratios between shoot apex and leaf in the A-subgenome of wheat.

Figure 3.5: Distribution of homeolog expression ratios between shoot apex and leaf in the A-subgenome of wheat.

To calculate expression ratios from the second or third subgenome, set the base option accordingly. For example, to calculate the ratios for the second and third subgenomes (i.e., B-subgenome and D-subgenome), set base = 2 and base = 3, respectively.

distr_plots_B <- plot_HER_distr(x, base = 2)
distr_plots_D <- plot_HER_distr(x, base = 3)

grid.arrange(distr_plots_A[["apex"]] + ggtitle("apex on A"),
             distr_plots_A[["leaf"]] + ggtitle("leaf on A"),
             distr_plots_B[["apex"]] + ggtitle("apex on B"),
             distr_plots_B[["leaf"]] + ggtitle("leaf on B"),
             distr_plots_D[["apex"]] + ggtitle("apex on D"),
             distr_plots_D[["leaf"]] + ggtitle("leaf on D"),
             ncol = 2)
Distribution of homeolog expression ratios between shoot apex and leaf across all subgenomes.

Figure 3.6: Distribution of homeolog expression ratios between shoot apex and leaf across all subgenomes.

Histogram peaks at one-third for all subgenomes indicate that most homeologs in wheat are equally expressed from the three subgenomes.

3.2.3 Statistical Test

After visually inspecting the distributions of homeolog expression ratios, use the hobit() function to identify homeologs showing significant changes in expression ratios between shoot apex and leaf tissues. This step takes approximately five minutes using eight threads and may vary depending on hardware performance.

x_output <- hobit(x)
head(x_output)
##                 gene       pvalue       qvalue   raw_pvalue   raw_qvalue D__A__(apex-leaf) D__B__(apex-leaf) D__D__(apex-leaf) OR__A__(apex/leaf) OR__B__(apex/leaf) OR__D__(apex/leaf)       Dmax     ORmax theta0__A  theta0__B theta0__D theta1__A__apex theta1__B__apex theta1__D__apex theta1__A__leaf theta1__B__leaf theta1__D__leaf  logLik_H0  logLik_H1
## 1 TraesCS1A02G024700 2.616739e-02 6.142580e-02 5.472214e-03 1.284557e-02        0.05167922        0.12996055       -0.17928711          1.3623240         1.81640254          0.4798122 0.17928711  2.084149 0.2173702 0.32747665 0.4526210       0.2429172     0.392673165       0.3628660       0.1907620       0.2621193      0.54235111  -81.88117  -76.75839
## 2 TraesCS1A02G029300 3.706660e-12 9.754368e-11 4.562229e-17 1.200587e-15        0.17852831       -0.10455004       -0.06845847          2.0997239         0.03141687          0.7594719 0.17852831 31.830030 0.4195390 0.05651968 0.5208880       0.5095908     0.003843973       0.4859704       0.3308907       0.1087523      0.55642154 -138.05460 -100.55997
## 3 TraesCS1A02G046900 4.416141e-08 6.494324e-07 3.061757e-11 4.502584e-10        0.03238216       -0.34732169        0.34336341          1.2643545         0.22193465         10.4774474 0.34732169 10.477447 0.1644786 0.58137584 0.2378548       0.1800823     0.408631990       0.4094569       0.1468518       0.7562532      0.06188941  -69.43195  -45.31271
## 4 TraesCS1A02G063600 5.812801e-01 7.221219e-01 4.604533e-01 5.723358e-01        0.03230882       -0.02971638        0.01087286          1.1652031         0.88727255          1.0652878 0.03230882  1.165203 0.3150442 0.46300573 0.2154362       0.3320959     0.446646310       0.2200791       0.2976223       0.4773411      0.21030973  -60.00160  -59.24261
## 5 TraesCS1A02G067300 2.436682e-02 5.857410e-02 4.942007e-03 1.187983e-02       -0.01115748        0.01800109        0.00755438          0.9511745         1.08020140          1.0380239 0.01800109  1.080201 0.3270065 0.37708495 0.2883377       0.3216624     0.386236395       0.2917494       0.3331768       0.3680521      0.28392523  -89.19414  -83.90700
## 6 TraesCS1A02G072800 3.086719e-09 6.018062e-08 6.825001e-13 1.338510e-11        0.15418804        0.20978820       -0.28546067          2.3896451         3.07768136          0.3032963 0.28546067  3.297106 0.2442037 0.26838213 0.4495816       0.3203725     0.373202695       0.3049541       0.1666320       0.1617500      0.59349763  -69.72687  -41.46689

The output can be sorted in ascending order of p-values using the following code.

head(x_output[order(x_output$pvalue), ])
##                   gene       pvalue       qvalue   raw_pvalue   raw_qvalue D__A__(apex-leaf) D__B__(apex-leaf) D__D__(apex-leaf) OR__A__(apex/leaf) OR__B__(apex/leaf) OR__D__(apex/leaf)      Dmax     ORmax theta0__A theta0__B theta0__D theta1__A__apex theta1__B__apex theta1__D__apex theta1__A__leaf theta1__B__leaf theta1__D__leaf logLik_H0  logLik_H1
## 441 TraesCS7A02G209600 1.712481e-34 8.562406e-32 5.375601e-49 2.687801e-46        0.47020891        -0.7240464         0.2527981         43.6406630         0.01364241         18.4817071 0.7240464 73.300842 0.2582908 0.5931562 0.1478044       0.4931408      0.23092207       0.2742402      0.02185092       0.9567519      0.02006728 -180.7215  -69.61950
## 62  TraesCS1A02G445500 6.065366e-29 1.516341e-26 4.604170e-41 1.151043e-38       -0.02651230         0.2702367        -0.2346247          0.8703762         4.38084864          0.3815521 0.2702367  4.380849 0.2607835 0.2704003 0.4634814       0.2480120      0.40542386       0.3459432      0.27416205       0.1346225      0.58084644 -191.3739  -98.74309
## 299 TraesCS5A02G022200 4.338958e-23 7.231597e-21 1.077170e-32 1.795283e-30        0.35991232        -0.1514078        -0.2057361          5.3638957         0.18918271          0.4329295 0.3599123  5.363896 0.3596230 0.1200954 0.5183583       0.5396559      0.04394851       0.4159984      0.17966007       0.1959991      0.62208898 -183.3925 -109.95285
## 268 TraesCS4A02G250600 3.364910e-21 4.206138e-19 5.413518e-30 6.766898e-28       -0.05530722         0.2734420        -0.1529501          0.7621023         4.46732813          0.5272217 0.2734420  4.467328 0.2874667 0.2725191 0.4078531       0.2600167      0.40953660       0.3297682      0.31379598       0.1350041      0.48497640 -119.9200  -52.04104
## 422 TraesCS6A02G314400 1.413693e-20 1.413693e-18 4.213261e-29 4.213261e-27       -0.30485015         0.2158651         0.1043494          0.2833779         3.28910794          1.7963450 0.3048501  3.528857 0.4963563 0.2568841 0.2380839       0.3444890      0.36464850       0.2904999      0.64884277       0.1486107      0.18558789 -150.7537  -85.65136
## 439 TraesCS7A02G197400 2.352092e-20 1.960077e-18 8.723259e-29 7.269383e-27        0.02933763         0.2274193        -0.2474215          1.1458380         4.28143124          0.3610695 0.2474215  4.281431 0.3168301 0.2190836 0.4591570       0.3313281      0.33279515       0.3354008      0.30198738       0.1048420      0.58218999 -143.3949  -78.97352

Next, we visualize changes in expression ratios of the first subgenome between the two tissues. In the plot, homeologs with q-values less than 0.01 are highlighted.

is_sig <- ifelse(x_output$qvalue < 0.01, "q<0.01", "q>=0.01")
plot_HER(x, label = is_sig)
Changes in homeolog expression ratios of the A-subgenome between shoot apex and leaf tissues. Each point represents the ratio in shoot apex (x-axis) and leaf (y-axis). Orange points indicate homeologs with significant changes, gray points indicate homeologs without significant changes.

Figure 3.7: Changes in homeolog expression ratios of the A-subgenome between shoot apex and leaf tissues. Each point represents the ratio in shoot apex (x-axis) and leaf (y-axis). Orange points indicate homeologs with significant changes, gray points indicate homeologs without significant changes.

To change the subgenome used for calculating expression ratios, set the base argument to the corresponding subgenome number. For example, to use the second subgenome, run the following.

plot_HER(x, base = 2, label = is_sig)
Changes in homeolog expression ratios of the B-subgenome between shoot apex and leaf tissues. Each point represents the ratio in shoot apex (x-axis) and leaf (y-axis). Orange points indicate homeologs with significant changes, gray points indicate homeologs without significant changes.

Figure 3.8: Changes in homeolog expression ratios of the B-subgenome between shoot apex and leaf tissues. Each point represents the ratio in shoot apex (x-axis) and leaf (y-axis). Orange points indicate homeologs with significant changes, gray points indicate homeologs without significant changes.

Note that expression ratios are calculated for the selected subgenome, while highlighted homeologs are those that meet the significance threshold in any subgenome. As a result, some homeologs may appear highlighted even if they show only minor changes in the plot, because they may have significantly altered expression ratios in another subgenome.

3.2.4 Combining Subgenomes

Additionally, users who want to test expression ratios between the combined A- and B-subgenomes versus the D-subgenome can first merge the expression values of the A and B homeologs using the combine_hexp() function. After combining, the statistical test can be performed on the merged dataset. Assuming the A- and B-subgenomes are stored as the first and second subgenomes in the x object (as defined in the mapping_table), set subgenomes = c(1, 2) to combine them.

x_ABvsD <- combine_hexp(x, subgenomes = c(1, 2))
x_ABvsD
## # 2 subgenome sets (D, A+B)
## # 500 homeolog tuples
## ---------------------
## Experiment Design:
##   group
## 1  apex
## 2  apex
## 3  apex
## 4  leaf
## 5  leaf
## 6  leaf
## ---------------------
## > subgenome: D 
##           [,1]      [,2]      [,3]        [,4]        [,5]        [,6]
## [1,]  69.91587  92.91579  69.32209  233.369104  182.885467  141.103672
## [2,] 121.76359 201.46227 146.99623 3421.450364 2267.779789 2905.972925
## [3,]  84.05615  99.86276  68.48688    0.000000    0.000000    0.000000
## [4,]  46.34872  41.68185  52.61797    5.185980    7.315419    5.084817
## [5,] 913.61969 952.60392 941.27700   23.336910   12.192364   34.322515
## [6,] 107.62330 120.70369 103.56553    1.296495    1.219236    0.000000
## +++++++++++++++++++++
## > subgenome: A+B 
##           [,1]      [,2]      [,3]        [,4]        [,5]       [,6]
## [1,]  122.5492  139.8079  143.6554  159.468888  168.254630  130.93404
## [2,]  164.9700  142.4130  189.5917 2672.076241 1719.123389 2322.49017
## [3,]  105.2666  140.6762  117.7640    1.296495    1.219236    8.89843
## [4,]  183.0382  154.5702  161.1947   11.668455   31.700148   22.88168
## [5,] 2161.8929 2321.1579 2341.9172   51.859801   51.207931   75.00105
## [6,]  228.6013  270.0636  255.5730    0.000000    0.000000    0.00000
## ---------------------

HOBIT can then be applied to the combined dataset as below.

x_ABvsD_output <- hobit(x_ABvsD)
head(x_ABvsD_output[order(x_ABvsD_output$pvalue), ])
##                   gene       pvalue       qvalue   raw_pvalue   raw_qvalue D__D__(apex-leaf) D__A+B__(apex-leaf) OR__D__(apex/leaf) OR__A+B__(apex/leaf)      Dmax     ORmax theta0__D theta0__A+B theta1__D__apex theta1__A+B__apex theta1__D__leaf theta1__A+B__leaf  logLik_H0 logLik_H1
## 62  TraesCS1A02G445500 3.848256e-17 1.924128e-14 7.962699e-24 3.981349e-21        -0.2487453           0.2487453          0.3603239           2.77528074 0.2487453  2.775281 0.4700614   0.5299386       0.3466064         0.6533936      0.59447553         0.4055245 -119.94996 -69.52158
## 6   TraesCS1A02G072800 1.544315e-13 3.860788e-11 1.069171e-18 2.672927e-16        -0.4803676           0.4803676          0.1181814           8.46157415 0.4803676  8.461574 0.5480605   0.4519395       0.3055404         0.6944596      0.78982973         0.2101703  -69.10290 -30.18006
## 107 TraesCS2A02G262400 1.914063e-12 3.190105e-10 3.832662e-17 6.387770e-15        -0.5210973           0.5210973          0.0979703          10.20720894 0.5210973 10.207209 0.4978944   0.5021056       0.2352690         0.7647310      0.75914356         0.2408564  -67.82316 -32.54267
## 441 TraesCS7A02G209600 1.102556e-11 1.378195e-09 4.616052e-16 5.770065e-14         0.2523203          -0.2523203         18.6952750           0.05348946 0.2523203 18.695275 0.1472152   0.8527848       0.2732705         0.7267295      0.01975070         0.9802493  -84.38861 -51.58576
## 355 TraesCS5A02G352700 2.695565e-11 2.603103e-09 1.644148e-15 1.644148e-13         0.3474770          -0.3474770          7.3613621           0.13584442 0.3474770  7.361362 0.2764225   0.7235775       0.4501455         0.5498545      0.09966664         0.9003334  -66.77285 -35.17498
## 268 TraesCS4A02G250600 3.123723e-11 2.603103e-09 2.027057e-15 1.689214e-13        -0.2796697           0.2796697          0.3156999           3.16756473 0.2796697  3.167565 0.4699997   0.5300003       0.3297345         0.6702655      0.61045817         0.3895418  -69.60987 -38.07149

3.2.5 Analysis Environment

sessionInfo()
## R version 4.5.1 (2025-06-13)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8        LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8   
##  [6] LC_MESSAGES=C.UTF-8    LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C           LC_TELEPHONE=C        
## [11] LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## time zone: UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] UpSetR_1.4.0    gridExtra_2.3   ggplot2_4.0.0   hespresso_1.0.1
## 
## loaded via a namespace (and not attached):
##  [1] tensorA_0.36.2.1     sass_0.4.10          future_1.67.0        generics_0.1.4       lattice_0.22-7      
##  [6] hms_1.1.3            listenv_0.9.1        digest_0.6.37        magrittr_2.0.4       evaluate_1.0.5      
## [11] grid_4.5.1           RColorBrewer_1.1-3   bookdown_0.44        fastmap_1.2.0        plyr_1.8.9          
## [16] jsonlite_2.0.0       processx_3.8.6       progress_1.2.3       backports_1.5.0      limma_3.64.3        
## [21] ps_1.9.1             scales_1.4.0         codetools_0.2-20     jquerylib_0.1.4      abind_1.4-8         
## [26] cli_3.6.5            rlang_1.1.6          crayon_1.5.3         cmdstanr_0.9.0       parallelly_1.45.1   
## [31] future.apply_1.20.0  splines_4.5.1        withr_3.0.2          cachem_1.1.0         yaml_2.3.10         
## [36] tools_4.5.1          parallel_4.5.1       checkmate_2.3.3      locfit_1.5-9.12      globals_0.18.0      
## [41] vctrs_0.6.5          posterior_1.6.1      R6_2.6.1             lifecycle_1.0.4      edgeR_4.6.3         
## [46] pkgconfig_2.0.3      progressr_0.16.0     pillar_1.11.1        bslib_0.9.0          gtable_0.3.6        
## [51] Rcpp_1.1.0           data.table_1.17.8    glue_1.8.0           statmod_1.5.0        xfun_0.53           
## [56] tibble_3.3.0         knitr_1.50           farver_2.1.2         htmltools_0.5.8.1    labeling_0.4.3      
## [61] rmarkdown_2.29       compiler_4.5.1       prettyunits_1.2.0    S7_0.2.0             distributional_0.5.0