5.1 Simulating Allotetraploid Data

The hespresso package provides functions to simulate artificial read counts for allotetraploid and allohexaploid species. By default, the simulation generates read counts for an allotetraploid with 10,000 homeologs across two conditions, each with three biological replicates. The mean and dispersion parameters are estimated from real RNA-Seq data from an allotetraploid Cardamine flexuosa dataset (Akiyama et al. 2021).

To generate a simulated dataset with default options, run as follows.

x <- sim_homeolog_counts()
x

## # 2 subgenome sets (A, B)
## # 10000 homeolog tuples
## ---------------------
## Experiment Design:
##     group replicate
## 1 group_1         1
## 2 group_1         2
## 3 group_1         3
## 4 group_2         1
## 5 group_2         2
## 6 group_2         3
## ---------------------
## > subgenome: A 
##        group_1__1 group_1__2 group_1__3 group_2__1 group_2__2 group_2__3
## gene_1         27         22         16         14         14         22
## gene_2        278          4         12         87        113         93
## gene_3        614        316        433        437        228        631
## gene_4        360        341        311        359        237        411
## gene_5        138         42        214        150        189        165
## gene_6       1028       1063       1171        383        385        342
## +++++++++++++++++++++
## > subgenome: B 
##        group_1__1 group_1__2 group_1__3 group_2__1 group_2__2 group_2__3
## gene_1          9         21         11         21         19         21
## gene_2        124        107        123        218        187        200
## gene_3        162        149        206        180        235        259
## gene_4        373        329        384        353        269        374
## gene_5         63         45         69         75         49         43
## gene_6        101         93        102         67         98         83
## ---------------------

Options can be flexibly customized by users. For example, to simulate data with 1,000 homeologs and 5 replicates per condition, run as follows.

x <- sim_homeolog_counts(n_gene = 1000, n_replicates = c(5, 5))
x

## # 2 subgenome sets (A, B)
## # 1000 homeolog tuples
## ---------------------
## Experiment Design:
##      group replicate
## 1  group_1         1
## 2  group_1         2
## 3  group_1         3
## 4  group_1         4
## 5  group_1         5
## 6  group_2         1
## 7  group_2         2
## 8  group_2         3
## 9  group_2         4
## 10 group_2         5
## ---------------------
## > subgenome: A 
##        group_1__1 group_1__2 group_1__3 group_1__4 group_1__5 group_2__1 group_2__2 group_2__3 group_2__4 group_2__5
## gene_1         59         58         58         51         47         70         43         71         53         45
## gene_2          8          2          5         36         13          6          9          6          7          6
## gene_3          4          4          4          4          9          0          0          1          8          0
## gene_4         27         34         34         28         18          8          6         10         18         15
## gene_5         52         43         32         54         37         31         10          5         22         10
## gene_6       1273       1749       1696       2733       1867       1506       1748       1757       2009       1948
## +++++++++++++++++++++
## > subgenome: B 
##        group_1__1 group_1__2 group_1__3 group_1__4 group_1__5 group_2__1 group_2__2 group_2__3 group_2__4 group_2__5
## gene_1         12          3         36          0          7         23         21         19         26         13
## gene_2          6          3          5          2          7          0          0          0          0          0
## gene_3          6          7          9         10          4          8          8         12          8          8
## gene_4         33         12          2          0          1          7          5         11          2          8
## gene_5        202        136        178        156        189        188        167        163        167        153
## gene_6        793        842        918        800        937        989       1112        845        945        992
## ---------------------

Simulated read counts are generated from a negative binomial (NB) distribution. The mean and dispersion values used for each homeolog are stored in the object returned by sim_homeolog_counts(). The get_sim_params() function can be used to retrieve these parameters. For example, to get means used for sampling read counts, run as follows.

param_mu <- get_sim_params(x, "mu")
names(param_mu)

## [1] "group_1" "group_2"

head(param_mu[["group_1"]])

##                  A          B
## gene_1   56.491374  17.328355
## gene_2    8.006597   4.818938
## gene_3    5.944184   6.766986
## gene_4   28.174125   7.362780
## gene_5   44.853340 166.388422
## gene_6 1902.944015 899.045310

The output is a list of data frames, one per condition, with columns representing individual subgenomes. These ground-truth mean values can be used to calculate true homeolog expression ratios for each condition. Users can then define homeologs that show significant changes in expression ratios between conditions.

Alternatively, the def_sigShift() function can automatically define homeologs with significant shifts without manual calculations. For example, to identify homeologs with a maximum absolute ratio difference (Dmax) greater than 0.2 or a maximum odds ratio (ORmax) exceeding 2.0, use the following code.

x <- sim_homeolog_counts()

is_sig <- def_sigShift(x, Dmax = 0.2, ORmax = 2, operator = "OR")
table(is_sig)

## is_sig
## FALSE  TRUE 
##  9260   740

The output shows the number of homeologs meeting these thresholds.

To visualize the distribution of expression ratios across conditions, use the plot_HER_distr() function. This returns a list of histograms, one per condition, with ratios calculated by default for the first subgenome (base = 1).

x <- sim_homeolog_counts()
distr_plots <- plot_HER_distr(x)

Since the simulated dataset contains two groups, the plot_HER_distr() function returns two histograms.

names(distr_plots)

## [1] "group_1" "group_2"

To display the distributions for each group, run as follows.

library(ggplot2)
library(gridExtra)

grid.arrange(distr_plots[["group_1"]] + ggtitle("group_1"), 
             distr_plots[["group_2"]] + ggtitle("group_2"),
             ncol = 2)

Figure 5.1: Distribution of homeolog expression ratios in two simulated groups.

To compare expression ratios between groups, use the plot_HER() function, which by default plots ratios for the first subgenome in a scatter plot.

plot_HER(x, alpha = 0.3)

Changes in homeolog expression ratios between two groups. Each point represents the expression ratio in group 1 (x-axis) and group 2 (y-axis).

Figure 5.2: Changes in homeolog expression ratios between two groups. Each point represents the expression ratio in group 1 (x-axis) and group 2 (y-axis).

If the dataset was generated using sim_homeolog_counts(), ground-truth significant shifts can be highlighted as follows using the def_sigShift() function.

is_sig <- def_sigShift(x, Dmax = 0.2, ORmax = 2.0)
plot_HER(x, label = is_sig, alpha = 0.3)

Changes in homeolog expression ratios between two groups. Orange points indicate homeologs with significant changes in expression ratios, while gray points represent those without significant changes.

Figure 5.3: Changes in homeolog expression ratios between two groups. Orange points indicate homeologs with significant changes in expression ratios, while gray points represent those without significant changes.