Genetic Relationship Matrix (GRM)

GRM, short for genetic relationship matrix, is a important matrix used in genetic analysis of human complex traits.

This blog is my study note for GRM.

Mixed linear model (MLM)

Based on Yang et al., AJHG, 2011

\[\pmb{y}_{n \times 1}=\pmb{X}_{n \times k}\pmb{\beta}_{k \times 1}+\pmb{W}_{n \times m}\pmb{u}_{m \times 1}+\pmb{\epsilon}_{n \times 1}\]

$\pmb{y}$ is a vector of phenotypes; n: the number of individuals;
$\pmb{X}$ is a matrix for variables with fixed effects; k: the number of fixed effect factors;
$\pmb{\beta}$ is a vector of fixed effects;
$\pmb{u}$ is a vector of random effects; $\pmb{u} \sim N(0, \pmb{I}\sigma_u^2)$, where $\pmb{I}$ is a $n \times n$ identity matrix, and m i the number of SNPs;
$\pmb{W}$ is a matrix for standardized genotype matrix, where its element for i-th SNP of j-th individual $w_{ij}=(x_{ij}-2p_i)/\sqrt{2p_i(1-p_i)}$ with $x_{ij}$ being the genotype value coded as the number of copies of the reference alleles {0, 1, 2};

\[Var(\pmb{y})=\pmb{V}_{n \times n}=\pmb{W}\pmb{W}^T\sigma_u^2+\pmb{I}\sigma_{\epsilon}^2\]

If define $\sigma_g^2=m\sigma_u^2$ (the variance explained by all the SNPs), $\pmb{g}=\pmb{W}\pmb{u}$ and $\pmb{A}=\pmb{W}\pmb{W}^T/m$ (genetic relationship matrix (GRM) between individuals),

\[\pmb{y}=\pmb{X}\pmb{\beta}+\pmb{g}+\pmb{\sigma}\text{ and }\pmb{V}=\pmb{A}\sigma_g^2+\pmb{I}\sigma_{\epsilon}^2\]

Practice

Simulate the genotype matrix

I use the R language to simulate a genotype matrix for n individuals and m SNPs

R code

n=5; m=3
set.seed(10)
p = runif(m, min=0.2, max=0.5) ### allele frequency were draw from a uniform distribution
x_A1 = t(replicate(n, rbinom(m, 1, p)))   ### for the plink ped file
x_A2 = t(replicate(n, rbinom(m, 1, p)))
x = x_A1 + x_A2
colnames(x) = paste("SNP",1:m,sep="")
rownames(x) = paste("indi", 1:n, sep="")
x
#### now we save the matrix x, convert to plink bfile, and then calculate GRM using GCTA
x_A1 = ifelse(x_A1==0, "a", "A")
x_A2 = ifelse(x_A2==0, "a", "A")
dir.create("toy")
write.table(cbind(data.frame(FID=rownames(x), IID=rownames(x), 0, 0, 0, 0), cbind(x_A1, x_A2)[,order(c(1:m, 1:m+0.5))]), "toy/toy.ped", quote=F, sep="\t", row.names=F, col.names=F)
write.table(data.frame(CHR=1, SNP=colnames(x), cM=0, BP=10000+1:ncol(x)), "toy/toy.map", quote=F, sep="\t", row.names=F, col.names=F)

Output

##       SNP1 SNP2 SNP3
## indi1    1    1    1
## indi2    0    1    0
## indi3    0    0    1
## indi4    0    1    0
## indi5    0    0    0

After standardization

R code

p_hat = apply(x, 2, sum)/(2*n)
w = apply(rbind(x,p_hat), 2, function(x) (x-2*x[length(x)])/sqrt(2*x[length(x)]*(1-x[length(x)])))[1:n,]
w

Output

##             SNP1       SNP2       SNP3
## indi1  1.8856181  0.6172134  1.0606602
## indi2 -0.4714045  0.6172134 -0.7071068
## indi3 -0.4714045 -0.9258201  1.0606602
## indi4 -0.4714045  0.6172134 -0.7071068
## indi5 -0.4714045 -0.9258201 -0.7071068

Calculate the GRM

The diagonal elements in GRM

R code

A = w %*% t(w) /m
A

Output

##            indi1       indi2      indi3       indi4       indi5
## indi1  1.6871693 -0.41931217 -0.1117725 -0.41931217 -0.73677249
## indi2 -0.4193122  0.36772487 -0.3664021  0.36772487  0.05026455
## indi3 -0.1117725 -0.36640212  0.7347884 -0.36640212  0.10978836
## indi4 -0.4193122  0.36772487 -0.3664021  0.36772487  0.05026455
## indi5 -0.7367725  0.05026455  0.1097884  0.05026455  0.52645503

R code

diag(A)

Output

##     indi1     indi2     indi3     indi4     indi5 
## 1.6871693 0.3677249 0.7347884 0.3677249 0.5264550

The off-diagonal elelments in GRM

R code

A[row(A)<col(A)]

Output

##  [1] -0.41931217 -0.11177249 -0.36640212 -0.41931217  0.36772487 -0.36640212
##  [7] -0.73677249  0.05026455  0.10978836  0.05026455

R code

A[row(A)>col(A)]

Output

##  [1] -0.41931217 -0.11177249 -0.41931217 -0.73677249 -0.36640212  0.36772487
##  [7]  0.05026455 -0.36640212  0.10978836  0.05026455

Run the “make-grm” using GCTA

Save the genotype matrix in Plink bfile formate

bash code

~/Software/PLINK2/plink --ped toy/toy.ped --map toy/toy.map --make-bed --out toy/toy   ### generate PLINK bfiles: .bim, .fam, .bed

log message

## PLINK v1.90b4.4 64-bit (21 May 2017)           www.cog-genomics.org/plink/1.9/
## (C) 2005-2017 Shaun Purcell, Christopher Chang   GNU General Public License v3
## Logging to toy/toy.log.
## Options in effect:
##   --make-bed
##   --map toy/toy.map
##   --out toy/toy
##   --ped toy/toy.ped
## 
## 16384 MB RAM detected; reserving 8192 MB for main workspace.
## Scanning .ped file... 0%20%40%60%80%100%
.ped scan complete (for binary autoconversion).
## Performing single-pass .bed write (3 variants, 5 people).
## 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%62%63%64%65%66%67%68%69%70%71%72%73%74%75%76%77%78%79%80%81%82%83%84%85%86%87%88%89%90%91%92%93%94%
--file: toy/toy-temporary.bed + toy/toy-temporary.bim + toy/toy-temporary.fam
## written.
## 3 variants loaded from .bim file.
## 5 people (0 males, 0 females, 5 ambiguous) loaded from .fam.
## Ambiguous sex IDs written to toy/toy.nosex .
## Using 1 thread (no multithreaded calculations invoked).
## Before main variant filters, 5 founders and 0 nonfounders present.
## Calculating allele frequencies... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%62%63%64%65%66%67%68%69%70%71%72%73%74%75%76%77%78%79%80%81%82%83%84%85%86%87%88%89%90%91%92%93%94%95%96%97%98%99% done.
## 3 variants and 5 people pass filters and QC.
## Note: No phenotypes present.
## --make-bed to toy/toy.bed + toy/toy.bim + toy/toy.fam ... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%62%63%64%65%66%67%68%69%70%71%72%73%74%75%76%77%78%79%80%81%82%83%84%85%86%87%88%89%90%91%92%93%94%95%96%97%98%99%done.

bash code

~/Software/gcta_1.93.2beta_mac/gcta64 --bfile toy/toy --make-grm --out toy/toy.grm  ### generate grm using GCTA

log message

## *******************************************************************
## * Genome-wide Complex Trait Analysis (GCTA)
## * version 1.93.2 beta Mac
## * (C) 2010-present, Jian Yang, The University of Queensland
## * Please report bugs to Jian Yang <jian.yang.qt@gmail.com>
## *******************************************************************
## Analysis started at 16:41:12 AEST on Tue Aug 03 2021.
## Hostname: imb20-015312-lt
## 
## Options: 
##  
## --bfile toy/toy 
## --make-grm 
## --out toy/toy.grm 
## 
## Note: GRM is computed using the SNPs on the autosome.
## Reading PLINK FAM file from [toy/toy.fam]...
## 5 individuals to be included from FAM file.
## 5 individuals to be included. 0 males, 0 females, 5 unknown.
## Reading PLINK BIM file from [toy/toy.bim]...
## 3 SNPs to be included from BIM file(s).
## Computing the genetic relationship matrix (GRM) v2 ...
## Subset 1/1, no. subject 1-5
##   5 samples, 3 markers, 15 GRM elements
## IDs for the GRM file has been saved in the file [toy/toy.grm.grm.id]
## Computing GRM...
##   100% finished in 0.0 sec
## 3 SNPs have been processed.
##   Used 3 valid SNPs.
## The GRM computation is completed.
## Saving GRM...
## GRM has been saved in the file [toy/toy.grm.grm.bin]
## Number of SNPs in each pair of individuals has been saved in the file [toy/toy.grm.grm.N.bin]
## 
## Analysis finished at 16:41:12 AEST on Tue Aug 03 2021
## Overall computational time: 0.00 sec.

Read GCTA-GRM output in R

R code

### ReadGRMBin function copied from GCTA website
ReadGRMBin=function(prefix, AllN=F, size=4){
  sum_i=function(i){
    return(sum(1:i))
  }
  BinFileName=paste(prefix,".grm.bin",sep="")
  NFileName=paste(prefix,".grm.N.bin",sep="")
  IDFileName=paste(prefix,".grm.id",sep="")
  id = read.table(IDFileName)
  n=dim(id)[1]
  BinFile=file(BinFileName, "rb");
  grm=readBin(BinFile, n=n*(n+1)/2, what=numeric(0), size=size)
  NFile=file(NFileName, "rb");
  if(AllN==T){
    N=readBin(NFile, n=n*(n+1)/2, what=numeric(0), size=size)
  }
  else N=readBin(NFile, n=1, what=numeric(0), size=size)
  i=sapply(1:n, sum_i)
  return(list(diag=grm[i], off=grm[-i], id=id, N=N))
}
data = ReadGRMBin("toy/toy.grm")
data

Output

## $diag
## [1] 1.6871693 0.3677249 0.7347884 0.3677249 0.5264550
## 
## $off
##  [1] -0.41931218 -0.11177249 -0.36640212 -0.41931218  0.36772487 -0.36640212
##  [7] -0.73677248  0.05026455  0.10978836  0.05026455
## 
## $id
##      V1    V2
## 1 indi1 indi1
## 2 indi2 indi2
## 3 indi3 indi3
## 4 indi4 indi4
## 5 indi5 indi5
## 
## $N
## [1] 3