Whole genome survey and microsatellite motif identification of Artemia franciscana

Abstract Artemia is an industrially important genus used in aquaculture as a nutritious diet for fish and as an aquatic model organism for toxicity tests. However, despite the significance of Artemia, genomic research remains incomplete and knowledge on its genomic characteristics is insufficient. In particular, Artemia franciscana of North America has been widely used in fisheries of other continents, resulting in invasion of native species. Therefore, studies on population genetics and molecular marker development as well as morphological analyses are required to investigate its population structure and to discriminate closely related species. Here, we used the Illumina Hi-Seq platform to estimate the genomic characteristics of A. franciscana through genome survey sequencing (GSS). Further, simple sequence repeat (SSR) loci were identified for microsatellite marker development. The predicted genome size was ∼867 Mb using K-mer (a sequence of k characters in a string) analysis (K = 17), and heterozygosity and duplication rates were 0.655 and 0.809%, respectively. A total of 421467 SSRs were identified from the genome survey assembly, most of which were dinucleotide motifs with a frequency of 77.22%. The present study will be a useful basis in genomic and genetic research for A. franciscana.


Introduction
The genus Artemia (Crustacea: Branchiopoda: Anostraca), known as brine shrimp, is an aquatic invertebrate living mainly in salt lakes. To date, seven species have been assigned to the genus Artemia excluding the parthenogenetic populations called Artemia parthenogenetica [1]. Artemia species are important in aquaculture industry because their dormant cysts are easily hatched, and the nauplii can be used as a nutrient-rich food for fish [2,3]. In addition, they are also widely used as aquatic model organisms for ecotoxicity tests, along with Daphnia [4][5][6]. However, despite the significance of Artemia in aquaculture, genomic research is still incomplete, and the genomic characteristics are less known, compared with Daphnia, because of the relatively large estimated genome size of 0.93-2.93 Gb [7][8][9].
Artemia franciscana, a native species in North America, has been extensively used and introduced to other continents for commercial purposes, thereby affecting the local population's biodiversity [10,11]. Additionally, there were some incorrect identifications, especially A. franciscana as Artemia salina, in citing the species used as test organisms in literature [12]. These problems, including population structure changes and species misidentification, suggest the need for not only morphological analyses but also population genetic studies and additional marker development to discriminate closely related species.
Genome survey sequencing (GSS) using next-generation sequencing (NGS) is a time-and cost-effective way to evaluate genome information, such as genome size, heterozygosity level, and repeat content, and can be used to develop molecular markers [13,14]. Simple sequence repeats (SSRs) or microsatellites are short tandem repeats of one to six nucleotides that have been utilized as genetic markers because of their outstanding abundance and high variability [15]. In Artemia, several microsatellite markers have already been developed for use in population genetic studies [16,17], but they are limited because they are not based on genome-wide data.
In the present study, we aimed to estimate the genomic characteristics of A. franciscana through GSS and then identify SSRs from GSS for microsatellite marker development. The present study would be useful for population genetics and molecular species identification and as a framework for subsequent whole-genome sequencing of A. franciscana.

Materials and DNA extraction
Nauplii of A. franciscana, originating from the Great Salt Lake (Utah, U.S.A.), were hatched on commercial cysts (INVE Technologies NV, Dendermonde, Belgium). Cultures were maintained in 30 g/l salt water at 25 • C with aeration. Live Tetraselmis sp. was fed to A. franciscana during the culture period. One egg-bearing female individual was cultured separately and the progenies were used in the subsequent experiments. Genomic DNA was extracted from whole five adults using phenol/chloroform method. The quality and quantity of the DNA were checked using a BioAnalyzer (Agilent Technologies, Santa Clara, CA, U.S.A.) and Qubit fluorometer (Invitrogen, Life Technologies, Carlsbad, CA, U.S.A.).

Genome sequencing, assembly, and K-mer analysis
Genomic DNA was randomly sheared into 350-bp fragments using an ultrasonicator (Covaris, U.S.A.). A paired-end DNA library was prepared and sequenced with Illumina Hi-Seq 2000 platform. To ensure the quality of data, adaptors, poly(N) sequences, and low-quality reads were filtered out, and only clean reads were subjected to K-mer (a sequence of k characters in a string) analysis. K-mer analysis was performed using Jellyfish 2.1.4 [18] with a K-value of 17, 19 and 25. Based on the 17-mer distribution, GenomeScope [19] in R version 3.4.4 [20] was used to estimate genome size, heterozygosity rate, and repeat content. The de novo genome assembly was carried out using Maryland Super-Read Celera Assembler (MaSuRCA) version 3.3.4 [21].

SSR detection and primer design
Genome-wide SSR identification was conducted using QDD version 3.1.2 pipeline [22]. First, the assembled sequences were used to extract microsatellite sequences with di-to hexanucleotides motifs. Next, sequences were compared using all-against-all BLAST to detect unique singleton sequences. In the last step, primer pairs were designed using two iterative methods for each sequence.

Genome sequencing data statistics
In the present study, a total of 22.8 Gb of raw data for A. franciscana were generated by Illumina paired-end library ( Table 1). Quality value (Q) is regarded to be correctly sequenced when Q20 and Q30 values are at least 90 and 85%, respectively [23]. The Q20 and Q30 values for the present study were 99.3 and 96.0%, respectively (Table 1); hence, the sequencing accuracy of A. franciscana was high. Additionally, the GC content of the raw data was 35.8% (Table  1).

Genome size prediction
In the present study, Illumina paired-end data was used for K-mer analysis using a K-value of 17. The predicted genome size was approximately 867 Mb (Figure 1). In a previous study, the genome size of A. fransciscana based on flow cytometry was 930 Mb [9]. Our estimation and previous results, measured using different methods, were quite similar, with a difference of 63 Mb. These estimates are smaller than 1.47 Gb in A. salina and 2.93 Gb in tetraploid parthenogenetic population [7,8]. In addition, the heterozygosity and duplication rates were calculated to be 0.655 and 0.809%, respectively (Figure 1).

Genome assembly results
The results of preliminary genome assembly of A. franciscana are shown in Table 2. We obtained 122231 contigs with a total length of 841603395 bp. The maximum and N50 contig lengths were 1508123 and 14130 bp, respectively. The GC content of contigs was 35.50% (Table 2). Further assembly generated 46193 scaffolds with a total length of 938041450 bp. The maximum and N50 scaffold lengths were 2555521 and 67542 bp, respectively. The GC content of the scaffolds was 31.85% (Table 2). These genome survey data provide useful information for genomic research of A. franciscana and related species. However, further study combined with more advanced NGS technologies using PacBio long read sequencing and high-throughput chromosome conformation capture (Hi-C) method are necessary to improve whole genome sequencing and assembly data. If so, A. franciscana could be used in a wider range of research fields (e.g. comparative genomics) as a reference genome.  (Table 3). The percentage of dinucleotide repeats was the highest, and as the repeat motif length increased, the number of loci decreased, similar to other studies [13,24,25]. It has been suggested that longer repeats have higher mutation rates, causing instability [26,27] and shorter persistence times because of their downward mutation bias towards a reduction in repeat number [28].
Overall, the motifs including A or T were more abundant than those including C or G, consistent with the findings of Daphnia pulex genome-wide SSR study [29]. These results might be due to the high slippage rate of A/T motifs, addition of 3 poly(A) tail by retrotransposon elements or transition of methylated C to T residues at CpG sites [14,[30][31][32]. These data for SSRs in A. franciscana will be used as valuable references for the development of microsatellite markers, although further validation studies using various Artemia population are needed.

Conclusion
In the present study, the genome of A. franciscana was analyzed and assembled, and SSR loci were identified from the GSS data. The K-mer analysis (K = 17) estimated the genome size of A. franciscana to be ∼867 Mb, and the heterozygosity and duplication rates were 0.655 and 0.809%, respectively. Genome assembly results showed that contig N50 was 14130 bp, with a total length of 841603395 bp, whereas scaffold N50 was 67542 bp, with a total length of 938041450 bp. A total of 421467 SSRs were identified, of which dinucleotide motifs were the most abundant and hexanucleotide motifs were the least abundant. The present study will be a useful foundation for genomic and genetic studies on A. franciscana.

Data Availability
The A. franciscana genome project has been registered in NCBI under the BioProject number PRJNA449186. The whole-genome sequence has been deposited in the Sequence Read Archive (SRA) database under accession numbers SRS3156165 and SAMN08892388.