Genome survey and microsatellite motif identification of Pogonophryne albipinna

Abstract The genus Pogonophryne is a speciose group that includes 28 species inhabiting the coastal or deep waters of the Antarctic Southern Ocean. The genus has been divided into five species groups, among which the P. albipinna group is the most deep-living group and is characterized by a lack of spots on the top of the head. Here, we carried out genome survey sequencing of P. albipinna using the Illumina HiSeq platform to estimate the genomic characteristics and identify genome-wide microsatellite motifs. The genome size was predicted to be ∼883.8 Mb by K-mer analysis (K = 25), and the heterozygosity and repeat ratio were 0.289 and 39.03%, respectively. The genome sequences were assembled into 571624 contigs, covering a total length of ∼819.3 Mb with an N50 of 2867 bp. A total of 2217422 simple sequence repeat (SSR) motifs were identified from the assembly data, and the number of repeats decreased as the length and number of repeats increased. These data will provide a useful foundation for the development of new molecular markers for the P. albipinna group as well as for further whole-genome sequencing of P. albipinna.

Taxonomically, the genus Pogonophryne is one of the complex taxa distinguished from other taxa by slight meristic differences, and their key diagnostic character, namely the mental barbell, is highly variable in some species [6,8]. It is difficult to compare the morphology of the species from this genus because many of them were described based on only a few specimens from a single sampling site [9,10]. Accordingly, taxonomists have divided the genus Pogonophryne into five species groups: P. mentella, P. scotti, P. barsukovi, P. marmorata, and P. albipinna groups [5,11].
Phylogenetic studies have been carried out on these groups using several mitochondrial and nuclear markers, and the monophyly of these five species groups was supported by mitochondrial NADH dehydrogenase subunit 2 (ND2) and cytochrome c oxidase I (COI) gene markers [5,10]. However, molecular identification at the species level showed poor resolution due to low genetic variations related to a very recent divergence of the genus Pogonophryne, as is the case with other species in the family Artedidraconidae [10,[12][13][14]. Therefore, it is necessary to develop markers with improved discriminatory ability for genome-wide analyses, such as microsatellite and single nucleotide polymorphism (SNP) markers. In particular, microsatellites, also termed simple sequence repeats (SSRs), have already been validated for their effectiveness in fish species delimitation [15]. The molecular data on Pogonophryne, mostly mitochondrial ND2 and COI, are available from the NCBI Gen-Bank database [2,5] for less than half of the species (13 out of 28). Among these species, P. albipinna has been reported recently with its complete mitochondrial genome sequence [16], and this is the first genome survey study of Pogonophryne. Pogonophryne albipinna, also known as white-fin plunderfish, belongs to the P. albipinna group, which is the most deep-living group of the genus and is mainly characterized by an absence of dark spots on the top of the head [1,5,11].
In the present study, based on next-generation sequencing (NGS), we estimated the genomic characteristics of P. albipinna and identified genome-wide SSR motifs. The present study can be used as a basis for further whole-genome sequencing of P. albipinna and the development of new molecular markers for distinguishing between Pogonophryne species.

Sample preparation and genome survey sequencing
Sample of P. albipinna was collected from the Ross Sea (77 • 05 S, 170 • 30 E on CCAMLR Subarea 88.1), Antarctica and frozen while being transferred to the laboratory. The frozen sample was dissected to obtain muscle tissue samples, which were used to extract genomic DNA following the traditional phenol-chloroform method. DNA quantity and quality were checked using a Qubit fluorometer (Invitrogen, Life Technologies, CA, U.S.A.) and a fragment analyzer (Agilent Technologies, CA, U.S.A.). Species were identified by morphology as well as using mitochondrial COI markers [17]. The DNA was randomly fragmented into 350-bp fragments using a Covaris M220 focused-ultrasonicator (Covaris, MA, U.S.A.). A paired-end DNA library was prepared and sequenced on the Illumina HiSeq 2000 platform according to the manufacturer's protocol.

Data analysis
The quality values of Q20 (percentage of bases whose base call accuracy exceeds 99%) and Q30 (percentage of bases whose base call accuracy exceeds 99.9%) and the GC content were evaluated from the primary Illumina paired-end data. K-mer analysis was conducted using Jellyfish 2.1.4 [18] with K-values of 17, 19, and 25. In order to estimate the genome size, heterozygosity rate and repeat content, we used GenomeScope [19] in R version 3.4.4 [20] based on the K-mer distribution (K = 25), which selected the one that the GenomeScope model showed the best match to the observed K-mer frequencies. The de novo draft genome was assembled using Maryland Super-Read Celera Assembler (MaSuRCA) version 3.3.4 [21], and contig-level assembly statistics were then calculated using the assemblathon stats.pl script (available at: https://github.com/ucdavis-bioinformatics/assemblathon2-analysis/blob/master/ assemblathon stats.pl; accessed on 1 January 2021) [22]. Genome-wide identification of di-to hexanucleotide microsatellite motifs with minimum five repetitions, and primer design were performed using the pipelines of QDD version 3.1.2 [23]. Microsatellites were extracted with 200-bp flanking regions on both sides and sequences shorter than 80 were eliminated. Three QDD steps were proceeded with default parameters, and -contig 1 (step 1), -make cons 0 (step 2) and -contig 1 (step 3) options were added. Primer pairs were selected by Primer3 software [24] to meet the following criteria: the expected PCR product size of 100-150 bp, the primer melting temperature (Tm) of 59-60 • C, and the primer length of 20-25 bases.

Genome size estimation and sequence assembly
The genome survey sequencing of P. albipinna yielded a total of ∼57.1 Gb of raw reads through the Illumina paired-end library ( Table 1). The Q20 and Q30 values of the raw reads were 96.6 and 91.8%, respectively (Table  1), indicating the high quality of this genome sequencing data [25]. In addition, the GC content of the raw reads was 41.7% (Table 1). The Illumina paired-end data were then used to predict the genomic characteristics of P. albipinna by K-mer analysis. Based on the 25-mer frequency distribution, the genome size was estimated to be 883.8 Mb, and the heterozygous and repetitive sequence rates were 0.289 and 0.751%, respectively (Table 2, and Figure 1).   In earlier studies, the nuclear DNA content of P. scotti was measured to be 4.05 pg/diploid cell using the Feulgen staining method [26]. When this measurement is converted into the haploid genome size, it shows that the nuclear DNA content of this species is 1.98 Gb, which is more than twice as high as our estimate. Meanwhile, other research on notothenioid genome size by flow cytometry showed that their genome size was 0.78-1.43 Gb [27], and more recent studies based on NGS data indicated a genome size of 0.64-1.06 Gb [28][29][30][31][32]. These size ranges are comparable with those indicated by our results, suggesting that further studies are needed to acquire more accurate knowledge of P. albipinna genome size.
Furthermore, the Illumina paired-end sequences of P. albipinna were assembled into contigs using MaSuRCA. We obtained 571624 contigs with a total length of 819289238 bp. The maximum and N50 contig lengths were 51460 and 2867 bp, respectively, with a GC content of 41.02% (Table 3). These results of genome survey sequencing provide useful preliminary data for further whole-genome studies to achieve more thorough assembly and chromosomal-level scaffolding using novel state-of-the-art genetic techniques.

Conclusion
In the present study, genome survey sequencing of P. albipinna was conducted to investigate its genomic characteristics and identify microsatellite motifs. The genome size estimated by K-mer analysis (K = 25) was 883.8 Mb, and the heterozygosity and duplication rates were 0.289 and 0.751%, respectively. The assembled genome had a total size of 819.3 Mb, with an N50 of 2867 bp and a GC content of 41.02%. A total of 2217422 SSR motifs were identified from the genome data, among which dinucleotide motifs accounted for the majority of repeat motifs (86.87%). These data will be a useful basis for novel molecular marker development as well as for further whole-genome sequencing of P. albipinna.

Data Availability
The P. albipinna genome project has been registered in NCBI under the BioProject number PRJNA697561. The whole-genome sequence has been deposited in the Sequence Read Archive (SRA) database under accession numbers: SRS13617358 and SAMN17672856.

Competing Interests
The authors declare that there are no competing interests associated with the manuscript.

Ethics Approval
Ethical approval was not required for the present study because no endangered or alive animals were involved. The specimen used in the present study was caught by line and hook fishing and was dead when collected. The present study including sample collection and experimental research conducted on these animals was according to the law on activities and environmental protection to Antarctic approved by the Minister of Foreign Affairs and Trade of the Republic of Korea (MOFA2794).