Plink bfile manipulation
Plink is one of the most widely used software in GWAS field and its relative file format bfile, a binary file used to store genotype information, may be more popular as supported by other softwares (e.g. GCTA) as input files.
“ped/map” files
In the plink file formats, .ped/map
files are text files used together to store pedigree + genotype information and variant information, respectively. For example, the below commands show the basic information of the test.map
and test.ped
, which is converted from test.bed/bim/fam files in gcta 1.26 software package.
$ head -2 test.map
1 rs4475691 0 836671
1 rs28705211 0.0151734 890368
$ wc -l test.map
1000 test.map
$ head -2 test.ped |cut -f 1-8 -d " "
1 11 0 0 1 -9 C C
2 21 0 0 2 -9 C C
$ wc -l test.ped
3925 test.ped
$ head -1 test.ped|awk '{print NF}'
2006
test.map
has 1000 lines, which represent 1000 variants with chromosome, SNP id, position in morgans and position in BP per line.
test.ped
has 3925 lines, which represent 3925 individuals with pedigree information in the first 6 columns and genotype information for the other columns. As one genotype has two alleles, test.ped
has 6 + 2 * 1000 = 2006 columns.
“bed/bim/fam” files
The .bed/bim/fam
files are also called bfile, which is the binary file of .ped/map
files. The bed file is the test file for genotype, the bim file is for variant information (similiar with map file, added with reference and alternative allele) and the fam file is the text file for pedigree information (first 6 columns of ped file).
$ head -2 test.bim
1 rs4475691 0 836671 T C
1 rs28705211 0.0151734 890368 C G
$ head -2 test.fam
1 11 0 0 1 -9
2 21 0 0 2 -9
The .bed/bim/fam
files are easy to read/write for softwares and can reduce the storage by more than 10-fold (30840+15776286/982003+34840+76286).
$ ls -l test.ped test.map
-rw-r--r-- 1 huanwei.wang Group-Yang 30840 18 Aug 13:42 test.map
-rw-r--r-- 1 huanwei.wang Group-Yang 15776286 18 Aug 13:42 test.ped
$ ls -l test.bed test.bim test.fam
-rw-r----- 1 huanwei.wang Group-Yang 982003 18 Aug 10:44 test.bed
-rw-r----- 1 huanwei.wang Group-Yang 34840 18 Aug 10:44 test.bim
-rw-r----- 1 huanwei.wang Group-Yang 76286 18 Aug 10:44 test.fam
$ echo ""|awk '{print (30840+15776286)/(982003+34840+76286)}'
14.4604
The .bed/bim/fam
and .ped/map
files can be converted to each other by plink software.
# bfile to ped/map
$ plink2 --bfile test --recode --out test
# ped/map to bfile
$ plink2 --file test --make-bed --out test
R package “plink2R”
Sometimes we need to use bfile to do more statsitcial analysis in R, so we need to read the bfile. One strategy is to convert it into text files ped/map
, while we can also use R package plink2R to read directly.
Installation
devtools::install_github("gabraham/plink2R/plink2R")
Usage
> library(plink2R)
Loading required package: Rcpp
Loading required package: RcppEigen
Warning message:
package ‘Rcpp’ was built under R version 3.4.1
> dat <- read_plink("test")
> dim(dat$bed)
[1] 3925 1000
> dim(dat$fam)
[1] 3925 6
> dim(dat$bim)
[1] 1000 6
But the plink2R is relatively slow and takes more memory, so try to manipulate bfile using plink as much as possible.
> object.size(dat)
31908168 bytes
C ad Python library “plinkio”
There is also packages for C and Python lauguage, e.g. plinkio.
from plinkio import plinkfile
plink_file = plinkfile.open( "test" )