Whole-genome survey and phylogenetic analysis of Gadus macrocephalus

Abstract Gadus macrocephalus (Pacific cod) is an economically important species on the northern coast of the Pacific. Although numerous studies on G. macrocephalus exist, there are few reports on its genomic data. Here, we used whole-genome sequencing data to elucidate the genomic characteristics and phylogenetic relationship of G. macrocephalus. From the 19-mer frequency distribution, the genome size was estimated to be 658.22 Mb. The heterozygosity, repetitive sequence content and GC content were approximately 0.62%, 27.50% and 44.73%, respectively. The draft genome sequences were initially assembled, yielding a total of 500,760 scaffolds (N50 = 3565 bp). A total of 789,860 microsatellite motifs were identified from the genomic data, and dinucleotide repeat was the most dominant simple sequence repeat motif. As a byproduct of whole-genome sequencing, the mitochondrial genome was assembled to investigate the evolutionary relationships between G. macrocephalus and its relatives. On the basis of 13 protein-coding gene sequences of the mitochondrial genome of Gadidae species, the maximum likelihood phylogenetic tree showed that complicated relationships and divergence times among Gadidae species. Demographic history analysis revealed changes in the G. macrocephalus population during the Pleistocene by using the pairwise sequentially Markovian coalescent model. These findings supplement the genomic data of G. macrocephalus, and make a valuable contribution to the whole-genome studies on G. macrocephalus.


Introduction
Gadus macrocephalus (Pacific cod) belongs to the order Gadiformes, Gadidae ( Figure 1). It is mainly distributed in the northern coast of the Pacific, from the Yellow Sea of China through the Sea of Japan and Okhotsk and Bering Seas to California in the eastern North Pacific [1,2]. The species is economically important throughout its distribution range. However, the age structure of G. macrocephalus from the Yellow Sea of China is relatively simple, and its resources are easily affected by changes in the external environment [3]. In recent years, owing to human overfishing and environmental variations, the total number of catches of G. macrocephalus in Chinese waters has declined [4]. At present, research on G. macrocephalus mainly focuses on its chemical composition [5], migratory patterns [6], population genetics [7], feeding [8] and immune responses [9]. Complete mitochondrial genome sequences are available for this fish [10], but the genome sequences relatively less are known, which will be immensely useful for the management of genetic resources.
With the rapid development of sequencing technology, whole-genome sequencing (WGS) has been widely applied in genomic studies of fish species [11,12]. Genome survey analysis using WGS data was performed in a number of species to obtain fundamental genomic information, viz. genome size, heterozygosity levels, repetitive sequence content and GC content [13,14]. Information on genome-wide microsatellite markers as well as mitochondrial genomes, obtained from genomic data, could be helpful for population genetics and evolutionary analysis. In addition, WGS data can also provide better insight into the demographic history of a population [15].
In this report, WGS data were employed to investigate the genomic characteristics of G. macrocephalus. The mitochondrial genome of G. macrocephalus was assembled. Moreover, phylogenetic relationships, microsatellite motifs and demographic history inference were also carried out. The genomic data and phylogenetic analysis will be significant for genetic research in the future.

DNA extraction and sequencing
The experimental design, sequencing and data analysis pipeline were showed ( Figure 2). We collected muscle tissues from one G. macrocephalus in the fishing port, Qingdao, Shandong Province, China, in December 2020. Then, the muscle tissue was quickly frozen in liquid nitrogen for 1 h before storage at −80 • C. The DNeasy Blood and Tissue Kit (Qiagen, Germany) was used to extract total genomic DNA from the muscle tissue following the manufacturer's instructions. For sequencing, DNA libraries were constructed with insert sizes of 350 bp, and paired-end sequencing was performed on the Illumina HiSeq 4000 platform. The sequence data were deposited in the Short Read Archive (SRA) database (http://www.ncbi.nlm.nih.gov/sra) under accession number PRJNA793360.

Genome assembly and identification of microsatellite motifs
To produce clean data, the raw data from WGS were filtered by Fastp 0.20.1 [16], removing the adaptors, ambiguous reads and low-quality reads. To assess the characteristics of the G. macrocephalus genome, we used the clean data for K-mer analysis (K = 19) in Jellyfish software [17]. From the results of K-mer analysis, we obtained the size, heterozygosity and repetitive sequence content of the genome. Genome size was estimated by using the equation Genome size = K-mer num / peak depth, where K-mer num is the total number of best predicted K-mers and peak depth is the expected value of the K-mer depth [17]. After that, genome size was revised by excluding K-mer errors via the formula Revised genome size = Genome size × (1 − Error rate) [17].
To obtain the genomic sequence, we used SOAPdenovo software [18] to assemble the genome. Before assembly, we excluded clean reads with a length <100 bp to generate a high-quality genome. Clean reads were assembled into contigs by using the de Bruijn graph algorithm [18]. On the basis of their pair-end relations, contigs were connected into scaffolds, with 'N' representing unknown sequences between two contiguous contigs [18]. To search for simple sequence repeats (SSRs) in the G. macrocephalus genome, we used the Perl script MIcroSAtellite (MISA, http://pgrc.ipk-gatersleben.de/misa) to identify microsatellite motifs. The parameters were set as the detection of mono-, di-, tri-, tetra-, penta-and hexanucleotide SSR motifs with a minimum of 10, 6, 5, 5, 5 and 5 repeats, respectively [19].

Mitochondrial genome assembly, phylogenetic and demographic history analysis
To obtain the 13 protein-coding gene sequences of mitochondrial genome, the clean reads from Fastp software were used to construct the mitochondrial genome via MitoFinder v.1.4 [20]. MitoFinder employs the MEGAHIT method [21] for mitochondrial genome assembly. We used the mitochondrial genome sequence of G. macrocephalus in the NCBI GenBank database (accession number: NC036931) as the reference. Then, these contigs were used as a guide for structural annotation of the mitochondrial genome in MitoFish [22].
To investigate the phylogenetic relationships and divergence times of Gadidae species (especially for intraspecific relationships), we downloaded 13 protein-coding gene sequences of the mitochondrial genome of Gadidae species and an outgroup (Bregmaceros nectabanus, smallscale codlet) from the GenBank database. A phylogenetic tree was drawn by the maximum likelihood algorithm with 1000 replicates in MEGA 11 [23]. Additionally, the divergence time analysis of Gadidae species was performed with the clocks tool of MEGA 11. We selected three calibration points from the TimeTree tool (http://www.timetree.org) [24] to determine the divergence times: (1) G. macrocephalus and Gadiculus thori (silvery pout) diverged 25. To infer the demographic history of G. macrocephalus, the whole genome of G. morhua (Atlantic cod; accession number: GCA 902167405.1) was retrieved from the NCBI database and used as a reference genome. Then, the clean reads of G. macrocephalus were aligned to the reference genome of G. morhua by using the bwa-mem algorithm in Burrows-Wheeler Aligner (BWA) [25]. We estimated the history of effective population sizes (N e ) using the pairwise sequentially Markovian coalescent (PSMC) model [15]. The PSMC analysis command included the maximum of 25 model iterations (-N25), the upper limit for the most recent common ancestor (-t15), '-r5' for the initial θ/ρ,  and '-p 4+25*2+4+6' atomic intervals. The reconstructed demographic history was plotted using the 'psmc plot.pl' script with a generation time of g = 6 years and a mutation rate of μ = 8.29 × 10 −9 site −1 year −1 (5.0 × 10 −8 site −1 generation −1 ) [26]. By default, the variance of N e was estimated using 100 bootstrap replicates.

Genome assembly and identification of microsatellite motifs
A total of 56.63 Gb raw data were generated by the Illumina HiSeq platform (Table 1). After filtering and quality assessment, the clean data amounted to approximately 53.94 Gb. The Q20 (97.19%) and Q30 (93.08%) were over 90%, and the GC content was approximately 44.73%. These data indicated that the clean data were sufficient to capture most of the genomic information.
Jellyfish software was used for K-mer analysis. As a result, the peak of the 19-mer frequency distribution was 69, and the total number of K-mers was 47,386,140,246 ( Figure 3). After removing K-mer errors, the results of K-mer analysis  showed that the revised genome size of G. macrocephalus was approximately 658.22 Mb. In the K-mer distribution, there was no obvious heterozygous peak, and the heterozygosity was approximately 0.62%. The repetitive sequence content of G. macrocephalus was 27.50%. The clean reads were used to carry out de novo assembly based on K-mer analysis. Table 2 showed the overall statistics of genome assembly. After assembly, a total length of 532,671,928 bp contigs were obtained with the contig N50 value of 867 bp and N90 value of 154 bp, and the maximum contig was 36,036 bp in length. A total of 1,187,811 contigs were obtained, and only the length of 119 contigs >10 kb. The number of scaffolds was 500,760 and the total length of scaffolds was approximately 527,441,282 bp, and gaps accounted for 3.02% of the total length of scaffolds. The scaffold N50 value of 3565 bp and N90 value of 424 bp, and the maximum scaffold was 66,220 bp in length.

Mitochondrial genome assembly, phylogenetic and demographic history analysis
Mitochondrial genome assembly yielded a closed circular DNA molecule with a length of 16,564 bp (accession number: OM174271). A comparison of the lengths of the mitochondrial genomes of G. macrocephalus from previously reported and assembled in this study, revealed that they were extraordinarily similar (a difference of only 3 bp), and the identity of these two mitochondrial genome sequences reached 100%. These two mitochondrial genomes contained the same set of 37 genes, including 13 protein-coding genes, 22 tRNA genes and 2 rRNA genes. Among them, 8 tRNA genes (tRNA-Pro, tRNA-Gln, tRNA-Ala, tRNA-Asn, tRNA-Cys, tRNA-Tyr, tRNA-Ser and tRNA-Glu) and ND6 were located in the light chain, and the others were located in the heavy chain. Besides, structural annotation of the mitochondrial genome indicated that it was composed of a coding region and a D-loop ( Figure 6). The D-loop was located between tRNA-Pro and tRNA-Phe with location ranging from 1 to 53 bp and 15,752 to 16,564 bp.
A maximum likelihood phylogenetic tree was constructed based on the 13 protein-coding gene sequences of the mitochondrial genome of Gadidae species, and showed a bootstrap value > 90% (based on 1000 replicates). Divergence time analysis showed that Gadidae species shared a common ancestor approximately 25.90 Mya (Figure 7). And the divergence time of seven lineages were mainly concentrated 0.07-11.95 Mya. The genus Theragra and Gadus displayed the closest genetic relationship and diverged 5.69 Mya. Furthermore, G. macrocephalus underwent genetic The PSMC model was used to infer historical changes in N e (Figure 8). The BWA results indicated that the rate of comparison was 96.76%, and the PSMC results showed a relatively stable N e . Demographic history analysis revealed that G. macrocephalus experienced a bottleneck effect from 300,000 to 40,000 years ago. Subsequently, the population experienced significant expansion between 40,000 and 15,000 years ago.

Genome assembly and identification of microsatellite motifs
The order Gadiformes species include some of the most important commercial fish (e.g., G. morhua, G. macrocephalus) in the world and account for approximately 18% of the world's total marine fish catch [27]. So far, the only available high-quality genome sequences of the order Gadiformes species are for G. morhua [28], M. aeglefinus [29], G. chalcogrammus (walleye pollock) [30] and Lota lota (burbot) [31]. Additionally, in the genus Gadus, the genomic data of G. morhua are relatively comprehensive [28,32]. However, very limited genomic researches have been focused on G. macrocephalus both at home and abroad.
In the present study, WGS was applied to preliminarily reveal the genomic data of G. macrocephalus. The genome size of G. macrocephalus was relatively small. Similarly, the genome sizes of the other fishes of the order Gadiformes species were less than 1 Gb: G. morhua (830 Mb) [28], M. aeglefinus (653 Mb) [29], G. chalcogrammus (683.61 Mb) [30] and Lota lota (576 Mb) [31]. The repetitive sequence content of G. macrocephalus was similar to the repetitive sequence content of other Gadidae species, such as G. morhua (25.40%) [28] and G. chalcogrammus (38.89%) [30]. But Lota lota (66.74%) [31] had higher repetitive sequence content. Repetitive sequences not only affect the structure of chromosomes but also control genomic evolution and recombination [33]. Further investigation is required to understand the functions of repetitive sequences in G. macrocephalus. For a genome assembly, if the heterozygosity is less than 1%, the genome is relatively effortless to assemble [17]. Therefore, for G. macrocephalus, the slightly low proportion of heterozygosity reflected that the genome assembly is relatively easy. Overall, combined with the  heterozygosity and repetitive sequence content being slightly low, the small genome will make genome assembly easier in subsequent studies. This is the first time that the genome sequences of G. macrocephalus were assembled. For the genome sequences of G. macrocephalus, the lengths of N50 contigs and scaffolds were short. A genome assembly would have high utility with an N50 contig size > 30 kb and an N50 scaffold size > 250 kb [34]. According to the criteria, the assembly quality was low in the present study. In the process of genome assembly, we selected K-mer (K = 19) possibly led to the low assembly quality [17]. Considering the low assembly quality, the gene prediction and annotation of genome sequences were not included, these genome sequences were only used to identify SSRs in the study. Studies have indicated that a higher-quality, chromosome-level genome assembly provides more accurate information for biological research  Demographic history analysis of G. macrocephalus (thick line) and the 100 bootstrap replicates (thin lines) inferred from the whole-genome sequence data. The x-axis represents years (g: generation time, μ: mutation rate), and the y-axis indicates the effective population size of G. macrocephalus. [11,12]. The strategy of Illumina combined with third-generation and Hi-C techniques [35][36][37] should be applied to obtain a higher-quality genome of G. macrocephalus.
Recently, SSRs, which are chromosome-specific and frequently inherited in a Mendelian codominant fashion, have become the most commonly used molecular markers [38]. Many researchers have used DNA microsatellite loci to study the genetic diversity of G. macrocephalus [39,40]. In the present study, the frequency of microsatellites in the G. macrocephalus genome was very high (1497.5 microsatellites per Mb), which was markedly higher than that reported in Sebastiscus albofasciatus (556.8 microsatellites per Mb) [13] and Tor tambra (971.39 microsatellites per Mb) [14]. Among the motif types of microsatellites, the dinucleotide repeats were the most abundant because the number of repetitions is inversely proportional to the length of repetitions [41]. To ensure the usability of the microsatellite markers developed to date, subsequent validation studies are required.

Mitochondrial genome assembly, phylogenetic and demographic history analysis
Complete mitochondrial genome is a powerful and popular tool to unravel the higher-level relationships of teleosts [42]. Hence, we constructed the complete mitochondrial genome of G. macrocephalus to further decipher its taxonomic status and systematic evolution. The genome organization and gene composition of the mitochondrial genome of G. macrocephalus were consistent with the results of a previous study using fish-versatile PCR primers [10]. The AT content was higher than the GC content in the mitochondrial genome of G. macrocephalus, similar to patterns reported in most vertebrates [43]. Besides, we further provided structural annotation of the mitochondrial genome of G. macrocephalus. Through the structural annotation of mitochondrial genome, we obtained the detailed information of genetic distribution and genomic structure, which could provide important raw materials for functional annotation and evolutionary analysis.
The phylogenetic tree showed that the phylogenetic relationships and divergence times among Gadidae species. 0.07-11.95 Mya, the diversity of seven lineages became increasingly abundant because climatic conditions were warmer and wetter [44]. Importantly, G. macrocephalus and G. morhua diverged 3.59 Mya, subsequently, G. macrocephalus from three different places diverged 0.07 Mya. Quaternary climate oscillations and geographical heterogeneity play important roles in determining species and genetic diversity distribution patterns [45], but how these factors affect the divergence of G. macrocephalus at the population level remains poorly understood. More efforts are still to be made in the further research on the divergence of this species.
Demographic history can be used to clarify the impact of geological and climate changes on the current species distribution, and help to formulate reasonable and effective strategies to preserve endangered species [46]. The present genetic structure of populations, species and communities has been mainly formed by Pleistocene climatic fluctuations [47]. Many marine fish species experienced bottlenecks and expansions in response to climate changes during the Pleistocene [48,49]. In the present study, reconstructing the Pleistocene demographic history of G. macrocephalus, revealed that G. macrocephalus population also experienced bottleneck and expansion. Similarly, Canino et al. reported that G. macrocephalus expansions were evident by using the mismatch distributions and Bayesian skyline plots of mitochondrial genome, but the start of expansions appeared to predate the last glacial maximum (21,000 years ago) [26]. This finding was consistent with our results. The results of demographic history analysis have important implications for the fishery management and conservation efforts of G. macrocephalus.

Conclusions
In summary, the first whole-genome survey of G. macrocephalus was performed based on WGS data. The present study uncovered genomic characteristics of G. macrocephalus, including small genome, slightly low heterozygosity and repetitive sequence content. SSRs were identified from the genomic data, which could provide novel insight into population genetics and germplasm resource conservation in the future. Besides, the evolutionary relationships and divergence times of Gadidae species were determined, and the demographic history of G. macrocephalus was reconstructed. To improve the assembly quality of G. macrocephalus genome, we suggest that high-quality genome sequences should be generated by using third-generation sequencing and Hi-C techniques in the future.

Data Availability
The sequence data were deposited in the Short Read Archive (SRA) databank (http://www.ncbi.nlm.nih.gov/sra) and are available under accession number PRJNA793360. The mitochondrial genome sequence was deposited in GenBank database with accession number OM174271.