Evaluation and comparison of population genetics software in Rabari Tribe of Gujarat population

Today, when forensic experts talk about quantifiable hereditary traits, they do not just depend on the assessment and examination of DNA profiles but also relate them to the population structures. The use of high-throughput molecular marker technologies and advanced statistical and software tools have improved the accuracy of human genetic diversity analysis in many populations with limited time and resources. The present study aimed to investigate the genomic diversity in Gujarat’s Rabari population, using 20 autosomal genetic markers. Numerous bio-statistical software programs are available for the interpretation of population data in forensics. These statistics deal with the measurement of uncertainty and also provides a probability of a random match. The present paper aims to provide a practical guide to the analysis of population genetics data. Three statistical software packages named Cervus, Genepop, and Fstat are compared and contrasted. The comparison is performed on the profiles obtained from fifty unrelated blood samples of healthy male individuals. DNA was extracted using the organic extraction method, 20 autosomal STR loci were amplified using PowerPlex 21 kit (Promega, Madison, WI, USA) and detected on 3100 Genetic Analyser (Life Technologies Corporation, Carlsbad, CA, USA). A total of 170 alleles were observed in the Rabari Tribe of Gujarat population, and allele frequencies ranged from 0.010 to 0.480. The highest allele frequency detected was 0.480 for allele 9 at locus TH01. Based on heterozygosity and the polymorphism information content, FGA may be considered as the most informative markers. Both the combined power of discrimination (CPD) and the combined power of exclusion (CPE) for the 20 analyzed loci were higher than 0.999999. The combined match probability (CPM) for all 20 loci was 2.5 × 10−22. With respect to the results, the 20 STR loci are highly polymorphic and discriminating in the Gujarat population and could be used for forensic practice and population genetics studies. However, Fstat demonstrated better genetic software for analysis of the demographic structure of a specific or set of populations.


Background
Short tandem repeats (STRs) markers have gained much popularity in forensic DNA analysis for human identity testing, paternity testing, and population genetics studies (Wyner et al. 2020). The genomic characteristics such as short sequence lengths, high polymorphism, and amplifying minute quantities of template DNA make these STR useful genetic markers in forensic DNA typing (Butler 2011;Nwawuba Stanley et al. 2020). Allelic frequency data obtained from unrelated individual in a population is essential. It is the key to obtain reliable results in an analysis of DNA profiles (Butler 2009). However, to date, few studies have been reported on autosomal STRs in the Gujarat population. Hence, there is a need to report more data in the studied population.
Here, we have reported allele frequencies and forensic parameters of 20 autosomal STR loci in a sample of 50 unrelated healthy adults from the Rabari population.
'Rabari', also known as Rewari or Desai, derived from the word Sanskrit, means 'outsiders' (Kohler-Rollefson 1992). They are settled in the western part of India, which includes the states of Gujarat and Rajasthan. The settlements are divided into 133 sub-tribes. This study reports the genetic portrait of the Rabari population using the PowerPlex 21 system (D1S1656, D2S1338, D3S1358, D5S818, D6S1043, D7S820, D8S1179, D12S391, D13S317, D16S539, D18S51, D19S433, D21S11, Amelogenin, CSF1PO, FGA, Penta D, Penta E, TH01, TPOX, and vWA). Genotype data was compared and evaluated using three population genetics software. Genetic analysis based on sizeable datasets can provide high statistical confidence that can be useful for forensic cases (Arenas et al. 2017). Powerful new methods have been developed to analyze genetic data, sometimes relying on massive computations. These methods are implemented in various software packages and programs, which have grown in number tremendously in the past few years (Butler 2006;Kumawat et al. 2020). Genetic software functions as per the data that needs to be analyzed. The population's demographic and genetic structure is defined by various parameters such as allelic frequencies, gene diversities, heterozygosity, F-statistics, kinship relation, parentage analysis, deviation from Hardy-Weinberg equilibrium (Mishra et al. 2019). In this study, three genetics software named Cervus, Genepop (Rousset 2017), and Fstat are compared and contrasted using the same dataset. The different software were selected based on (i) ease of downloading, (ii) open access software, (iii) the ability to analyze co-dominant data, and (iv) ease of running using a Microsoft Window interface (Coombs et al. 2008).
This research paper offers a concise and straightforward guide to the principles that form the basis of the most common analyses. It focuses on some of the most widely used computer software in population genetics that runs on the Windows operating system. A detailed comparative study reveals all the software's insides and applications, thus facilitating appropriate selection and use.

Sample collection
The University Research Ethics Review Board approved the study. Settlements of the Rabari population were identified in the state of Gujarat. Individuals from these settlements were approached in person with the help of a Gramsevak (village co-ordinator) or village head of that area. All the participants were briefed about the purpose of the study. With the aim to investigate the genetic diversity of the Rabari population of Gujarat, 50 randomly selected healthy male individuals were chosen for this study. Peripheral blood from 50 unrelated male individuals was collected and stored into EDTA tubes. The participants were duly informed, and consent was obtained, as per the Helsinki Declaration (Rickham 1964). Participants ranged from 20 to 50 years of age, respectively.

DNA extraction and quantification
Genomic DNA from whole blood samples was extracted using organic extraction method. Isolated DNA was quantified with Real-Time PCR ABI 7500 (Applied Biosystems, Foster City, CA, USA) using the Quantifiler DNA Quantification Kit (Applied Biosystems, Foster City, CA, USA).

DNA electrophoresis and analysis
The PCR products were size separated via capillary electrophoresis using ABI 3100 Genetic Analyzer (Life Technologies Corporation, Carlsbad, CA, USA) and sized with GeneScan500-LIZ internal lane size standard (Thermo) as per the manufacturer's recommended protocol. GeneMapper ID-X Software Version 1.4 (Applied Biosystems Foster City, CA, USA) was used to determine amplified fragments' fragment size. All alleles' designations were based on a comparison with allelic ladders provided in the PowerPlex 21 system. All steps were carried out according to the quality assurance standards recommended by the Scientific Working Group on DNA Analysis Methods (SWGDAM 2010).

Statistical analyses
Allelic frequency and parameters of forensic interest such as genetic diversity, polymorphism information content (PIC), Hardy-Weinberg test (HWE), observed heterozygosity (HO), expected heterozygosity (HE), null allele frequency, and F-statistics were calculated using these software programs. The latest versions of the software were studied for functions and features. All three software are Freeware and operate on Windows, Linux, and Mac operating systems. All three programs were tested using a fixed data set of 50 individuals of the targeted population. Twenty genotype markers have been considered for this comparison. The first problem to be addressed was the input data file format which varied between different software packages. A minute error of space or comma could make the data unreadable or missorted. Organizing data into the proper format is timeconsuming and often takes longer than the analysis. There are some programs available that facilitate importing or exporting of data as per the requirement rather than reformatting data manually. These different software programs allow experts to prepare the input data file in the required format and make the analysis easier and faster. It is significant where the data set may have to be subjected to more than one application for analysis. An overview of software has been illustrated with data generated from these twenty autosomal loci in the studied population. The alleles generated from a genetic analyzer were separated by Gene mapper software and exported into an excel sheet. It is a universal method to enter the population data.
As such, Cervus reads the text-based file of genotypes for analysis purpose. This software reads the data in (.csv) format. Cervus software, as its 3.0.7 version, can be downloaded from (www.fieldgenetics.com). It provides a template that functions for both co-dominant and diploid data. The data inserted in Cervus can be analyzed and converted in the Genepop format, i.e., (.txt) (if unable to read, try using a double extension like txt.txt). Genepop software, as its 4.7.5 version, can be downloaded from (http://kimura.univ-montp2.fr/r ousset/Genepop.htm). Genepop can convert the input file into different software readable formats such as Fstat and Biosys. For this study, it was converted into Fstat format, i.e., (DAT extension). Fstat is a computer program that calculates F statistics and can be downloaded from (https://www2.unil.ch/popgen/softwares/fstat.htm). All the necessary results were compiled and compared with each other. The significant features and functions of all the three software were noted in the comparative chart (Table 1).

Tools for the population genetic analyses
Cervus (field genetics) version 3.0.7 Cervus software analyses genetic data generated from co-dominant markers, namely microsatellites and SNPs. This software functions on two principles. Firstly, the genetic markers are independently inherited or in linkage equilibrium. Secondly, the nature of species is diploid and genetic markers are autosomal. Cervus software offers the statistical likelihood method. It is mainly employed for parentage analysis and occasionally for genetic analysis. Cervus offers other additional features such as allele frequency analysis, simulation of parentage analysis) (Marshall et al. 1998), parentage and identity analysis, and convert the genotype file into another format such as gene pop, genetix, and kinship. The software can detect those datasets containing thousands of loci. It calculates the following parameters: (1) Hardy-Weinberg equilibrium; (2) polymorphism information content; (3) observed heterozygosity; (4) expected heterozygosity; (5) alleles per locus (k); (6) F-test; (7) nonexclusion probability for the first parent, second parent, pair parent, identity, and sib identity (Kalinowski et al. 2010).

Input data files
Cervus reads input data files in comma-delimited (.csv) and text format (.txt). All input files can be created in spreadsheet packages such a Microsoft excel.

Output data files
Cervus reports for each analysis independently in a text file with (.txt) extension. For example, the results are different for each analysis (.sim) for simulation parentage analysis, (.alf) for allele frequencies (refer to Table 2 and  Table 3).

Comments
Floating-point overflow can occur in the case of a large number of loci. Reported bugs in the older version of Cervus have been resolved in Cervus 3.0.7 (Kalinowski et al. 2010). Selected input files can occasionally crash, most commonly a genotype file. The new version has a feature of a workaround (turn off "preview input files" on the options menu) to resolve this issue. The reasonable error rate is set to 1% for the starting point. If the kinship relationships are known, then Cervus can estimate the actual proportion of loci mistyped from the frequency of mismatches between parents and offspring (Konuma et al. 2000). It has its application in conservation genetics as it gives an accurate parentage and identity analyses that might help wildlife researchers to carry out the population study of wildlife species.

Genepop Version 4.7.5
It is a software package available on the R platform. This software is developed and maintained by Francois Rousset (Package et al. 2020). This population genetic software is used for both haploid and diploid data. Genepop has two major functions: (1) calculates linkage disequilibrium, allele frequency, gene diversity, Hardy-Weinberg exact tests, population differentiation test, null allele frequency, analyze a single genotypic matrix, basic information such as genotypic matrices, observed and expected homozygotes and heterozygotes, estimates Nm, and F-statistics such as Fst and other correlation and isolation by distance; (2) convert file into other formats such as Fstat (data.DAT), Biosys (data.BIO), and Linkdos (data.LKD). The missing data in the datasets can be easily handled by Genepop software. It does not have any restrictions on the number of populations or loci (Raymond and Rousset 1995).

Input data files
It accepts the input file in (.txt) format, which can be converted by using Cervus software. The input file of Genepop software should be in ASCII format file data. Once the program is launched, statistical parameters appear and we can choose any of the options that need to be analyzed.

Output data files
Results are stored automatically with the title data.D, data.E, (data is a preferred name of a file). Different analyzed options save their results in their specific extensions. The Genepop outputs are reported in Table 4.     (Crawford 2010). Genepop needs a fast-working processor to obtain accurate results within a reasonable length of time (Rousset 2008). It is a software that has no limitations for the number of population or loci. There is also a web-based version of this program (Excoffier and Heckel 2006).

Fstat version 2.9.4
It is a computer program to calculate F-statistics, developed and maintained by Jérôme Goudet (Goudet 1994  Similarly, Fstat also estimates the Wright's fixation indices (Fis, Fst, and Fit values), which assess population structures' different levels. Fis is a measure of withinpopulation heterozygosity deficit; also called a Wahlund effect and Fit is a measure of the global heterozygote deficit. However, Fst is a measure of between-population heterozygosity deficit. It can have a limit of 3000 individuals, and it can run up to 200 samples. It can also be used for haploid datasets, and missing data can be easily handled. It is a powerful tool for analyzing various aspects of population genetics over other software like Powerstats (which is more time-consuming and laborintensive).

Input data files
For Fstat, it is necessary to create an input file named data (.DAT). If we have a three-digit number of alleles, we have to code three (001-999) and be separated by any number or space. Genepop software has the feature to convert the input file in .DAT format.

Output data files
There are tap separators that allow the direct reading of the output file in different available spreadsheets. It has the feature of facilitating printing options and graphical presentation of data. The outputs of Fstat are reported in Table 5.

Comments
Version 1.2 has many fewer features that have been updated and modified in the newer version (Goudet 1994). Fstat can process a large number of data set in a shorter time. As it only supports one type of input data format, that may create a problem for a researcher to calculate the data in a single software. Fstat has many performing features that can be helpful to define the demographic structure of the population.

Result
A researcher may face difficulty in creating an input file. The three programs studied here are linked indirectly as cervus can convert the specific file into genepop format and genepop can convert that file into Fstat format. Software programs employed in this study make it convenient by creating a readable file format. These software tools were used to calculate various forensic parameters. To analyze a large data set, it is necessary to have such a time-saving and user-friendly program. The conversion of an input data file in the appropriate format is a must. The software needs to support an input file in all possible extensions. A graphical presentation makes the understanding of parameters easy.
In the Rabari population, a total of 170 alleles with corresponding allele frequencies ranging from 0.010 to 0.480 were observed (Table 3). All the loci fall under Hardy-Weinberg equilibrium after applying Bonferroni correction (Bland and Altman 1995) at a 95% confidence level. The locus D18S51 showed the maximum number of observed alleles, i.e., 14, whereas loci TPOX showed the least number of observed alleles, i.e., 4. The mean number of alleles per locus among the studied loci was found to be 8.500. The allele 9 (0.48) of locus TH01 was the most frequent allele in this population. The observed heterozygosity (Hobs) ranged from 0.660 (PENTA-D) to 0.900 (PENTA-E, FGA) and expected heterozygosity (Hexp) ranged from 0.671 (TPOX) to 0.866 (FGA). The most polymorphic locus among the studied population was FGA, with a value of 0.841, and the least polymorphic locus observed was TPOX with a value of 0.616.
The other forensic parameters such as a power of discrimination (PD), power of exclusion (PE), paternity  index (PI), and matching probability (PM) were calculated through PowerStats v1.2 spreadsheet program (Tereba 1999). The power of discrimination among all the studied loci ranged from 0.826 to 0.947 and was considered highly discriminating for forensic and population genetics studies. The combined probability of match (CPM) and combined paternity index (CPI) for the studied loci are 2.5 × 10 −22 and 2.42 × 10 8 . The combined probability of exclusion (CPE) and the combined power of discrimination (CPD) are observed as 0.999999996 and 1, respectively. Locus wise distribution of the most common allele (MCA) and least common allele (LCA) in Rabari Tribe is shown in Table 6. The genetic diversity value was observed to be highest (0.862) at a locus PENTA-E and lowest (0.670) at locus TPOX (Fig. 1). Fstat software also analyzed the Fis (correlation of genes within individuals within the population) values for each locus. Fis value can help determine the level of inbreeding in one population compared to another one. This p value must have 95% confidence levels which make the data more robust and informative. For example, if the Fis value of any population observed to be 0.25 and two individuals from that population were mated, then the resulting offspring would be inbred. Their inbreeding coefficient would be ½*0.5 = 0.25. The highest Fis value was found at locus PENTA-D, with 0.203 followed by the lowest (− 0.003) at D1S1656 and D16S539.

Discussion
With the aim of estimating the genetic relatedness among the populations included in this study, their intrinsic genetic distance was also calculated.The neighbor joining (NJ) dendrogram was derived based on Nei's genetic distance (DA) through the POPTREE2 software (Takezaki et al. 2010).The robustness of the phylogenetic relationship established by the NJ dendrogram was estimated using bootstrap analysis with 1000 replications.The test was applied to compare the allelic frequencies of the presently studied population (Gujarat) with the previously studied eight populations and their published data set-Balmiki (Punjab) (Ghosh et al. 2011), Konkanastha Brahmin (Maharashtra) (Ghosh et al. 2011), (Iyengar (Tamilnadu) (Ghosh et al. 2011), Gond (Madhya Pradesh) (Ghosh et al. 2011), Riang  (Tripura) (Ghosh et al. 2011), Munda (Jharkhand) (Ghosh et al. 2011), Nepal (Kraaijenbrink et al. 2007), and Serbia (Takić Miladinov et al. 2020) ( Table 7). As depicted in Fig. 2, the NJ dendrogram revealed the clustering of the studied populations into three groups. The Iyengar (Tamilnadu) and Konkanastha Brahmin (Maharashtra) formed one group. Populations from the Riang (Tripura) and Nepal formed the second group. Populations from the Rabari Tribe (Gujarat), Munda (Jharkhand), and Gond (Madhya Pradesh) formed the third group. In accordance with the observations recorded through NJ dendrogram, close genetic affinity could be seen between the studied population (Gujarat) and population of Munda (Jharkhand) and Gond (Madhya Pradesh).
The present study on Rabari Population is the first report on data pertaining to polymorphism on FGA autosomal STR locus in this population. A detailed analysis of the polymorphism of the 20 autosomal markers as observed in this study clearly establishes the efficacy of  FGA marker for forensic casework, paternity testing, population genetics studies, and familial DNA searching in the Rabari population.This finding has been consistent with the findings in other similar studies on various Indian populations (Dubey et al. 2009;Ghosh et al. 2011;Chaudhari and Dahiya 2014;Ekka et al. 2020;Kakkar et al. 2020) reconfirming that FGA marker exhibits the highest polymorphism and is thus the most useful and unique marker for studying Indian populations. After testing and evaluating the software (latest versions), the new features and modifications of the software were identified. Genepop has some parallel features like Fstat, but it cannot compete in the new modifications like biased dispersal, simultaneous testing of Fis, Fst, Fit values, among others. Cervus can be vital for a study of the wildlife population, but due to the limitations of performing functions, it may fail to perform some statistical functions. Genepop and Fstat can estimate f-statistics, whereas Fstat can analyze both Nei and Weir and Cockerham families of estimators of gene diversities and F-statistics. All tests were carried out by using randomization methods which effectively displayed the dominance and utility of Fstat program over the remaining two. It overcomes the limitations of the remaining two software as it has the features related to Fstatistics and drastically reduces the analysis time by displaying the least inconsistencies between analyses.
In light of the facts discovered in this study, the authors found Genepop and Fstat software to be best suited for forensic applications and strongly advocate using these two over others in the context of similar researches on population genetics. The Cervus software was found to have limited applications in population genetics from forensic perspectives. Its merits and shortfalls have been cataloged for clear understanding of its features. It was also felt that comparisons between some of these software are not appropriate owing to their fundamental differences in purposes for which they have been devised. For example, though the Cervus software helps in conversion of genotype files into Genepop formats, it is predominantly meant for parentage analysis in plant and animal populations and thus it should not be weighed against the other genetic analytics software which has other or additional functionalities. The present research has demonstrated and provides the template guide to the analysis of co-dominant data and selection of appropriate software besides arguing in favor of using more than one software program for getting a comparative evaluation of outputs on any parameter included in the study.

Conclusion
The present study established the valuable genetic information on 20 autosomal STR loci using PowerPlex 21 kit (Promega, Madison, WI, USA) in Rabari Tribe of Gujarat population. The calculated forensic parameters showed that the studied 20 STR markers are highly polymorphic and can be applied in forensic testing as well as in demographic and anthropological studies. According to the geographic or demographic location, differences in population are observed which can be concluded based on genetic distance values. The studied populations (Gujarat) are closely related to Munda (Jharkhand) and Gond (Madhya Pradesh) but distant from geographically distant countries such as Serbia. However, further research is recommended on this population with large sample size to confirm these results.

Limitations
The study provided genotype and frequencies data of the autosomal STR genetic markers of the Indian Rabari Tribe for forensic practice albeit all the analyzed samples were male individuals.