推荐一个高效做PCA分析的工具,从VCF文件到直接出图的一键式分析,对于小白非常友好。小明哥刚开发时小编就在用了,那时还叫MingPCACluster,今年五月终于发表见刊,恭喜。
[图片上传失败...(image-c406fb-1718460402712)]
VCF2PCACluster 是基于群体SNP数据VCF格式开发的PCA分析和聚类软件,同时兼并了Genotype 等格式软件,即只要对应的一个输入文件进来,这PCA和作图分组等一步到位。简单、易用和高效。 其中主要功能有:
- SNP位点过滤:如三碱基,MAF等
- 5种算法计算亲缘关系矩阵kinship
- 基于kinship进行 PCA分析
- 3种聚类算法对PCA的结果进行聚类分析
- 基于PCA的结果和聚类结果进行可视化
主要亮点:一步高效生成PCA和聚类图。其中为了强调核心是高效低内存, 一步操作,一个输入到PCA结果和出图,对用户友好。
地址:https://github.com/hewm2008/VCF2PCACluster
小编使用的体验确实比其他大多数软件要快多了。
[图片上传失败...(image-f00625-1718460402712)]
[图片上传失败...(image-76677c-1718460402712)]
安装
git clone https://github.com/hewm2008/VCF2PCACluster.git
cd VCF2PCACluster; chmod 755 -R bin/*
./bin/VCF2PCACluster -h ### print help information
主要参数
# for more Help document please see the manual. Para [-i] is show for [-InVCF], Para [-o] is show for [-OutPut]
Usage: VCF2PCACluster -InVCF in.vcf.gz -OutPut outPrefix [options]
-InVCF <str> Input SNP VCF Format
-InKinship <str> Input SNP K Kinship File Format
-OutPut <str> OutPut File Prefix(Kinship PCA etc)
-KinshipMethod <int> Method of Kinship [1-5],defaut [1]
1:Normalized_IBS(Yang/BaldingNicolsKinship)
2:Centered_IBS(VanRaden)
3:IBSKinshipImpute 4:IBSKinship 5:p_dis
-ClusterMethod <str> Method For Cluster[EM/Kmean/DBSCAN/None] [EM]
-help v1.40 Show more Parameters and help [hewm2008]
InFile:
-InGenotype <str> InPut Genotype File for no VCF file
-InSubSample <str> Only keep samples from subsample List for PCA[ALLsample]
-InSampleGroup <str> InFile of sample Group info,format(sample groupA)
SNP Filtering:
-MAF <float> Min minor allele frequency filter [0.001]
-Miss <float> Max ratio of miss allele filter [0.25]
-Het <float> Max ratio of het allele filter [1.00]
-HWE <float> Exact test of Hardy-Weinberg Equilibrium for SNP Pvalue[0]
-Fchr <str> Filter the chrX chr[chrX,chrY,X,Y]
-KeepRemainVCF keep the VCF after filter
Clustering:
-RandomCenter Random diff-center to Re-Run Cluster for Kmean
-BestKManually <int> manually set the Best K (Num of Cluster) (auto)
-BestKRatio <float> Get the best K Cluster by deta-SSE Ratio[0.15]
-MinPointNum <int> Minimum point number of D-cluster[4]
-Epsilon <float> Epsilon for DBSCAN_Distance/EM_convergence (auto)
-Iterations <int> iterations number for EM clustering[1000]
OutPut:
-PCnum <int> Num of PC eig [10]
软件中英文文档已经写得非常详细,具体查看:https://github.com/hewm2008/VCF2PCACluster
美中不足的是我们往往并不需要cluster的结果,所以这最好是作为一个选项,不然我还是得自己绘图,那何不用Plink呢?
唠叨
小明开发的生信工具主打一个低调、简单、实用,比如LDBlockShow、PopLDdecay、RectChr、Reseqtools、NGenomeSyn等。小编作为他的前同事,早已经成为这些软件的忠实粉丝,希望他能继续开发出好用的生信工具。也欢迎大家多多使用和引用。