GENOTYPE
DATA QUALITY
The Genotyping Group makes recommendations on how to deal with various genotyping quality issues. Several measures are being taken to ensure that the GAIN data are of high quality.
1. Initial HapMap Samples: Both centers (Perlegen Sciences Inc. and the Broad Institute) used their platform and SNPs on the 270 HapMap samples. Both datasets are publicly available. These genotype data provide information on the platforms and allow the data produced for each disease study to be integrated with related studies using other platforms.
Click Here for the Perlegen HapMap Genotype QC data
Click Here for the Broad HapMap Genotype QC data
2. QA Samples: For each study, QA samples will be genotyped
in addition to the study samples. These samples
will include standard HapMap samples already
genotyped for 4 million SNPs, duplicate study
samples, and (when available) the parents of
some study samples. These samples will provide
information on data quality for each study,
including information on the completeness rates
of samples heterozygous or homozygous for each
SNP, and confirmation of Mendelian inheritance
of variants.
For studies with all or a substantial number of
mother-father-child trios, each plate of 96
samples will include 1 duplicate of a study
sample (duplicates of different study samples
for different plates) and one standard HapMap
sample (for the studies genotyped by the Broad
center), or half the plates will have a
duplicate and half a HapMap sample (for the
studies genotyped by the Perlegen center). The
HapMap sample will be chosen from a standard set
of HapMap trios, and may differ among plates.
For studies without a substantial number of trios, each plate
will include two parents of a study sample to
form a trio (with a different trio on each
plate), as well as 1 duplicate of a study sample
and 1 standard HapMap sample (Broad) or 1 study
duplicate on half the plates and 1 standard
HapMap sample on half the plates (Perlegen).
3. QC for Genotyping: The samples will be genotyped in a way
that maximizes data quality for interpretation
of the association results. For example, case
and control samples will be on the same plates
and done at the same time. Plate layouts will
differ to catch any sample mix-ups.
4. Data Released: For each study, the data released will
include the genotype calls with quality
measures, genotype cluster data, and the cel
files. All the genotype data produced will be
released. Data that are considered bad will be
flagged, but will still be available. These bad
data are useful to calibrate platforms and
calling algorithms, and to search for real
phenomena such as Hardy-Weinberg deviations or
polymorphic insertions or deletions causing
Mendelian inconsistencies.
5. Data QA Pipeline: When a genotyping vendor has genotyped
the samples, it will send the data to NCBI to be
put through a data quality assessment pipeline,
which will provide information on the genotyping
data completeness and quality. This pipeline is
being developed by Gonçalo Abecasis and
implemented at NCBI. Any issues that arise
after the data are run through the pipeline will
be resolved between the study principal
investigator, the genotyping vendor, and NCBI.
When any issues have been resolved the genotype
and phenotype data will be released.
6. Genotype Data Quality Standards: Prior to genotyping a study set of samples, each genotyping
vendor will perform a quality check to ensure
they are suitable for genotyping. If any DNA
samples fail to meet the requirements for
quantity, concentration, and quality at this
stage, sample replacement will be worked out
between the study principal investigator and the
genotyping vendor. When the production
genotyping has been done, bad samples will be
re-genotyped once.
The genotype data should meet and hopefully will exceed
these quality standards:
Remove samples with fewer than 80% of the SNPs called.
Of > 480k SNPs for Perlegen and 500k SNPs for Broad, at least
90% of the SNPs will be good:
§
any SNPs out of HW will not count as good (where “out of HW”
means more significant than p = 0.001 for 2000
samples, but the p value will be adjusted for
larger sample sizes that can produce
statistically significant but not meaningful HW
deviations);
§
the call rate minimum per SNP = 90% and the average across
SNPs > 97%;
§
for HapMap QA samples the average call rates for heterozygotes
and homozygotes are both > 97%; and
§
the concordance rate in duplicates is > 99.5%.
|