日期:2023-01-09 10:59:54
来源:互联网
点击:19
以下为您整理了grf only task effects on the statistical map的答案
我搞了个godex ez124-u plus条码打印机,,用来打印标签的,我不会用的,但是让我无语的是连说明书都没有,驱动盘也没有,我在网上找了驱动也安装完成了,不过找不到条码打印软件,找到的都是收费的, 速印标签软件我也用了,不过怎么只能打印三张,而且前面总要空一个,我连着打印急需答案驱动文件和条码打印软件请向销售商索取,这个是免费的,希望可以帮到你,卫视中国有你更精彩
如何在ARCGIS中把降水由点插值到面上谢谢您的帮助!用3D analyst 或Geostatistical Analyst
都可以做到。可以看一下相关的入门文件。
1、如果用3D analyst,相当于打降水点作为高度点输入,再转化为三维的地形面。进入help 找到extentions,其中的3D analyst ,看一下Create surfaces.
2、如果用Geostatistical Analyst,进入help 找到extentions,Geostatistical Analyst.
另外这两个功能,ArcGIS Desktop 都有很详细的入门教程文件。如果你安装的是ArcGIS 9.0系统在C:Program FilesArcGISDocumentation文件夹下可以找到
3D_Analyst_Tutorial.pdf
Geostatistical_Analyst_Tutorial.pdf
这两个文件,按照里面的步骤先把练习题做一遍就能基本掌握了。
曾经在网上找到过一个译成中文的地统计分析教程,图贴不上,内容参考一下吧。
ArcGIS 地统计上机指南
本指南根据ESRI的地统计分析TUTOR编写,共六个练习
练习1:利用缺省参数创建一个臭氧浓度分布图
练习2:数据检查
练习3:创建一个臭氧浓度分布图
练习4:模型对比
练习5:污染超标预测--创建臭氧浓度超出某一临界值的概率图
练习6:整饰,生成最终成果图
练习数据:
所在路径D:arcgisArcTutorGeostatistics
数据集 描述 备注
Ca_outline 美国加州轮廓图
Ca_ozone_pts 臭氧采样点数据(单位:ppm)
Ca_cities 主要城市位置图 参照解释污染分布现象
Ca_hillshade 山体阴影图 整饰用图
ca_NO2_pts NO2采样点数据(单位:ppm) 自由练习用
ca_ PM10_pts PM10采样点数据(单位:ppm) 自由练习用
练习1:利用缺省参数创建一个臭氧浓度分布图
目的:熟悉地统计工作一般流程
1.1 地统计扩展模块简介
ArcGIS地统计分析模块在地统计学与GIS之间架起了一座桥梁。使得复杂的地统计方法可以在软件中轻易实现。体现了以人为本、可视化发展的趋势。
地统计学的功能在地统计分析模块的都能实现,包括:
(1)ESDA:探索性空间数据分析,即数据检查;
(2)表面预测(模拟)和误差建模;
(3)模型检验与对比。
地统计学起源于克里格。当时他用此法预测矿产分布,后来经过别人改进修改发展成为现在所用的克里格方法。虽然空间数据分析还有其他方法,如IDW(反距离加权插值法)等,但克里格方法是最主要、最常用的空间分析方法,下面也以此法为主进行。
1.2表面预测主要过程
ArcGIS地统计扩展模块的菜单非常简单,如下所示,但由此却可以完成完整的空间数据分析过程。
一个完整的空间数据分析过程,或者说表面预测模型,一般为。拿到数据,首先要检查数据,发现数据的特点,比如是否为正态分布、有没有趋势效应、各向异性等等(此功能主要由Explore Data菜单及其下级菜单完成);然后选择合适的模型进行表面预测,这其中包括半变异模型的选择和预测模型的选择;最后检验模型是否合理或几种模型进行对比;(后两种功能主要由Geostatistical Wizard…菜单完成)。Create Subsets…菜单的作用是为把采样点数据分成两部分,一部分作为训练样本,一部分作为检验样本。
下面将按上述表面预测过程进行叙述。
(注:[1]文章示例中所使用的数据为ArcGIS扩展模块中所带的学习数据(某地测得的臭氧含量样本),整个过程均使用此数据;[2]文章以操作方法介绍为主,所涉及到的地统计方法和基本理论一般未进行解释,可查阅相关地统计理论资料;操作中所用到的某些参数为地统计中的标准名称的也未进行解释。)
我们下面的任务是根据测量所得到的某地臭氧浓度数据进行全区的臭氧浓度预测。首先检查数据的特点,然后根据数据特点用不同参数进行表面模型预测,随后比较不同模型的精确程序,选择最佳模型,最后制作成果图。
我们下面的任务是根据测量所得到的某地臭氧浓度数据进行全区的臭氧浓度预测。首先检查数据的特点,然后根据数据特点用不同参数进行表面模型预测,随后比较不同模型的精确程序,选择最佳模型,最后制作成果图。
1.3 操作步骤
(1) 建立一个新空白地图
(2) 用add layer功能增加Ca_outline和Ca_ozone_pts两个图层
(3) 保存地图,命名为Ozone Prediction Map.mxd
(4) 在extention中将geostatices打钩
(5) 在工具栏中增加Geostatistical Analyst工具条
(6) Geostatistical Wizard
(7) 在Attribute中选择OZONE, 单击next
(8) 在Methods中选择Driging, 单击next
(9) 利用缺省参数,一路按next,最终按finish得到臭氧浓度分布图。
注意:观察过程中出现的对话框和图表并思考其作用。
练习2:数据检查
目的: 检测数据分布;发现数据可能存在的趋势;找出数据间的空间自相关及方向效应。
数据检查,即空间数据探索分析(ESDA)
此功能主要通过Explore Data菜单中实现。
扩展模块提供了多种分析工具,这些工具主要是通过生成各种视图,进行交互性分析。如直方图、QQ plot图、半变异函数/协方差图等。
• (1)直方图显示数据的概率分布特征以及概括性的统计指标。
Geostatistical Analyst-Explore Data-Histogram
下图中所展示的数据,中值接近均值、峰值指数接近3。从图中观察可认为近似于正态分布。克里格方法对正态数据的预测精度最高,而且有些空间分析方法特别要求数据为正态分布。
• (2)正态QQ Plot图: Geostatistical Analyst-Explore Data-Normal QQplot
检查数据的正态分布情况。作图原理是用分位图思想。利用QQ图可以将现有数据的分布与标准正态分布对比,直线表示正态分布,如果数据点越接近直线,则它们越接近于服从正态分布。从图中可以看出数据很接近正态分布(左上角几个偏离的点被选中)。
• (3)趋势分析图。Geostatistical Analyst-Explore Data-Trend Analysis
只有在你的数据中存在某种均势时,你才可能利用某些数学公式对表面的非随机(确定性)成分进行表达。Trend Analysis工具使你能够找出在输入数据集中是否存在趋势。
趋势分析图中的每一根竖棒代表 了一个数据点的值(高度和位置)。这些点被 投影到一个东西向和一个南北向的正交平面上。通过投影点可以作出一条最佳拟合线(一个多项式),并用它来模拟特定方向上存在的趋势。如果该线是平直的,则表明没有趋势存在。
蓝线表示南北方向,呈水平,可见南北方向无趋势。绿线表示东西方向,呈倒U形,可用二阶曲线拟合,在后面进行表面预测时将会去除。
点击Rotete右边的方向旋转箭头(横向箭头),可旋转趋势图,更明显地显示某一个方向的趋势。
(4)Voronoi图
用来发现离群值。Voronoi图的生成方法:每个多边形内有一个样点,多变形内任一点到该点的距离都小于其他多边形到该点的距离,生成多边形后。某个样点的相邻样点便会与该样点的多边形有相邻边。至于多边形值的计算有多种方法,可以用生成多边形的样点值作为多边形的值(Simple方法),也可以以相邻样点的平均值为多边形的值(Mean方法),具体计算方法可以在Type下拉菜单中选择。
(5)理解数据的空间自相关和方向效应
Geostatistical Analyst-Explore Data-Semivariogram/Covariance Cloud
半变异函数/协方差函数。
该图可以反映数据的空间相关程度,只有数据空间相关,才有必要进行空间插值法。图表的横坐标表示任两点的空间距离,纵标表示该两点的半变异函数值。空间自相关理论认为彼此之间距离越近的事物越相象,因而x值越小,y值应该越小。
在半变异函数/协方差函数支图中,每一个红点表示一对采样点.随着样点对间距离的增加,半变异函数值也相应增加.然而,当到达一定的距离后,云图变平,这表明,超出这个距离后,样点间不再具有相关关系。
单击并拖动光标使选择的点高亮显示,然后回到地图窗口可看到这些样点对间通过直线相连,用以表示是一对采样点。
观察方向效应:选中 Show Search Direction复选框,单击并将方向指针移动到任意角度。指针指向的方向决定了哪些样点会出现在半变异函数图中。例如,如果批的是东西方向,那么只有那些处于彼此东西方向的点对才会在半变异函数图中显示,就使你能够去除你不感兴趣的那些点对从而来检查你数据中的方向效应。
检查后请单击Selection 菜单,然后单击Clear Selected Features 以释放地图中高亮显示的采样点。
如果任意两点的值都要计算,当采样点很多时,数据量便很大,因而根据距离和方向对样点距离进行了分组。下列参数便是为此要求而设置:Lag,步长值;Number of,步长组数。步长值和步长组数之乘积应小于采样点区域的坐标范围的一半。如下图。
最后的两个图表是针对两个数据集而言的。
(6)普通Qqplot分布图
评估两个数据集分布的相似程度。利用两个数据集中具有相同累积分布值的数据值来作图。
(7)正交协方差函数云。
横坐标:两点间的距离;
纵坐标:两点间的距离所对应的样点对的理论正交协方差。
这些图彼此相关联,并与ArcMap中的图层相关联。即,在某个分析图中选择某些样点,在ArcMap图层及其他分析图中同样会选中这些点。如下图。
练习3:创建一个臭氧浓度分布图
目的:在数据检查的基础上进行表面预测对练习一中生成的臭氧浓度图进行改进,学习基本地统计思想。
3.1 制作表面预测图:
通过上面的数据检查,发现数据接近正态分布、有空间相关、无离群值、东西方向有倒U形趋势。决定使用普通克里格方法进行表面预测。下面的步骤是针对此数据进行的。
将使用地统计模块的第二个菜单Geostatistical Analyst……。
第一步:选择输入数据和方法面板(Choose Input Data and Method)
选择使用的数据及其属性:分别在Input和Attribute中选择
选择预测方法:在Methods中选择。预测方法的选择要根据数据分析的结果而定。现在假如选择Kriging方法(其实所谓地统计方法,最主要并且用的最多的就是Kriging方法的几种变化形式)。
Validate是个可选项,选择使用何种方法对生成的预测图进行检验,如果想用检验方法,则选中此项并设置检验数据集和属性;如果对结果进行交叉检验,则不要选择此项。
第二步:地统计方法选择面板(Geostatistical Method Selection)
选择Ordinary Kriging中的Prediction Map,即使用普通克里格方法生成一个表面预测图。普通克里格方法是最常用的地统计分析方法。其他几种依次为简单克里格、泛克里格、指示克里格、概率克里格、析取克里格。这几种克里格的区别是由于克里格的形式及其数据特点的不同。
Transmition选项:对数据集进行转换,由于某些方法要求数据正态分布,因此如果数据与正态分布差距很大,可以在此选择一种方法对数据进行转换。
Order of trend:如果数据在某方向上存在趋势,则为了提高预测的准确性,一般要剔除趋势。在此处选择趋势方程的阶数:线性、一阶、或无趋势等。数据的趋势有无以及阶数在数据检查时得到,即用Explore Data菜单下的Trend analysis来分析得到。
第三步:趋势剔除面板(Detrending)
在练习2中我们已经发现数据中存在一个全局趋势,为南东-北西方向,并且可以用一个二阶多项式进行拟合,该趋势可以从数据中剔除,并可以用一个数学公式表达。一旦剔除全局趋势后,就可以对表面残差或表面的短程变异成分进行统计分析。在创建最终预测图之前,该还将自动添加回来以产生更有意义的结果。全局趋势剔除后所进行的分析将不再受其影响。而一旦将全局趋势添加进来,就能够生成一个更加精确的的预测图。
此面板只有在第二步中选择了Order of trend选项时才会出现。
在Geostatistical Method Selection对话框中,单击Order of Trend Removal 下拉箭头,选择Second.
Click Next on the Geostatistical Method seleciton dialog box. 在Geostatistical Method Selection 对话框中单击next按钮。
上图中的椭圆表示的是数据集的全局地拉南西-北东向变化最快段北西-变化则较平缓。
单击Detreding 对话框中的Next
第四步:半变异函数/协方差模型面板(Semivariogram/covariance Modeling)
剔除趋势后,半变异函数就可以模拟数据点间的空间自相关而不用考虑数据中存在的趋势。该趋势将在生成最终表面之前用于计算。
此步的主要功能为半变异函数建模,是预测过程中的实质性阶段。在此面板中需要设定许多与拟合半变异函数相关的选项以及半变异函数的参数。是克里格预测中十分关键的部分。
Semivariogram/covariance部分显示的是拟和的模型,黄线即半变异函数曲线。
Models部分:model1,model2,model3表示可以用多个通用函数来拟和半变异函数模型。如果数据为各向异性,则需要选中Anisotropy(其实大多数空间数据是各向异性的,各向同性只是相对的),当选中此选项时,黄线变为多条,表示多个方向的拟合函数。
Show Search Direction选项选中后,表示只搜索某个方向的半变异函数。
Nugget:块金值,函数参数之一,即函数与y轴相交的y值。
Error Modeling:如果数据中有测量误差(比如仪器原因等)的话,则选中此项,预测表面将光滑许多。
上图中Lag Size表示步长值,Number of为步长的组数,可采用不同的步长及组数来重新拟合缺省球面模型。减小步长意味着你可以有效地放大并模拟相邻采样点间避部变民的细节。
第五步:搜索区域面板(Searching Neighbourhood)
定义一个圆(或椭圆)利用其内的点来预测那些未知点的值。
此外为避免某一特定方向上的偏差,可以把这个圆(或椭圆)分成若干个小扇区,在各扇区内先取相同数目的点,你可在上图对话框中指定点的数目,半径或长/短轴,以及用来预测的圆或椭圆中的扇区数。
数据视图窗口中高亮显示的点表明了在预测未知点时,各相关点的权重(%),某点的权重越大,其对示知值的预测影响也越大。
此面板的主要功能是设定预测某点数值时如何搜索邻近的已测量点。主要有样点数(neighbours to)和搜索形状(shape)两个选项。
Neighbours to:最大搜索数目,离预测点太远的样点对预测无意义。
Include at least:最小样点数目。
Shape:设置如何搜索样点,有图解。
第六步:交叉验证面板(Cross Validation)
在此面板中查看预测的精度,有四个图表,现以最左边的预测图表进行说明。
图表的横坐标为测量制值,纵坐标为预测值,最理想的情况是数据呈1:1线,即图中的破折线。
左下方的预测误差(precited error)项是预测误差的一些统计值,可很好的体现预测的好坏。其中,Mean:0.0005718(预测误差的均值);Root-Mean-Square:0.01154(预测误差的均方根);Average Standard Error:0.01456(平均预测标准差)、Mean Standardized:0.02688(平均标准差);Root-Mean-Square Standardized:0.8463(标准均方根预测误差)。其中前四项越小越好,最后一项越接近1越好。
右下方的项含有每个点的误差、标准差等数据,
第七步:数据图层信息面板(Output Layer Information)
该面板中显示了在数据预测过程中设置的参数,可以查看。
点击OK,即可生成预测图。
实际操作中经常要重复以上过程,并结合经验以期获得尽量精确并能作出合理解释的预测图。
练习4:模型对比
目的:结合交叉验证统计表,判断不同模型与参数下哪组预测结果更为准确。
一般情况下,有时候某些参数难以判断,因而会生成几个预测表面,然后比较不同表面的精度,选择精度最高的作为结果。(Ordinary Kriging表面是用上述过程中的方法生成的预测表面,default是用缺省的参数得到的预测表面)
右键点击Ordinary Kriging并选择Compare…,即会出现下面的检验面板。To后面即为要对比的预测表面。通过下面的预测参数,很容易便可看出,Ordinary Kriging的精度明显高于Default。
练习5:污染超标预测--创建臭氧浓度超出某一临界值的概率图
本练习使用指示克里格法。
(1) 单击Geostatistical Analyst - Geostatistical Wizard.
(2) 选择Ca_ozone_pts
(3) 在Attribute中选择ozone
(4) 在Method 中 选择Kriging
(5) 在Choose Input Data and Method中单击next
(6) 选择Indicator Kring
(7) Primary Threshod Value 设为0.12ppm
(8) 点击选中Exceed radial
(9) 在Geostatistical Method Selection对话框中点next
(10) 在Additional Cutoffs Selection 中点next
(11) 点Anisotropy说明数据的方向性
(12) 设定步长为25000,组数为10
(13) 在Cross Validation 点Finish
(14) Output layer information 中 点击ok
(15) 该图显示的是指示预测值是超出0.12ppm臭氧浓度概率。
练习6:整饰,生成最终成果图
缺省情况下,生成的预测图按照采样数据的坐标范围显示成一个矩形。(如前面所示)现在要把它的范围显示到州界的范围。思路为先把预测表面外推,覆盖整个州界,然后再用州界进行限定,把表面限制在州界的范围。
第一步:外推。
在ArcMap目录表中右键单击预测表面名,
在快捷菜单中选择Properties,
在Layer Properties面板中点击Extent页;
在Set the extent to下拉菜单中选择a custom extent entered below,
然后在下面的Visible Extent项中设置坐标范围。(此图中分别设置为左:-240000,右:-1600000,上:860000,下-400000)。
设置后结果如图。
点击确定,得到结果:
第二步:范围限制。
在ArcMap目录表中右键单击Layers,
选择Properties,
点击Data Frame,
在Clip to Shape项中选中Enable前的复选框,然后点击Specify shape…,在Data Frame Cliping面板中指定限制图形为ca_outline,点击OK,点击确定。
结果如下图。
后面可继续进行比例尺、图例等的设置,在此不一一赘述。(完)
借助NMF的力量对单细胞RNA和单细胞ATAC进行联合分析有偿求助!联合分析 scRNA-seq 和 scATAC-seq 的流程类似于整合多个 scRNA-seq 数据集的流程,都依赖于联合矩阵分解和分位数归一化。 主要区别在于: (1)scATAC-seq数据需要处理成基因级别的值; (2) 仅对 scRNA-seq 数据进行基因选择; (3) 下游分析可以使用基因水平和基因间信息 。
为了联合分析 scRNA 和 scATAC-seq 数据, 首先需要将 scATAC-seq 数据(一种全基因组表观基因组测量)转换为与来自 scRNA-seq 的基因表达数据相当的基因水平计数 。大多数以前的单细胞研究都使用了一种受传统bulk ATAC-seq 分析启发的方法:识别染色质可及峰,然后将与每个基因重叠的所有峰相加。这种策略也很有吸引力,因为 10X CellRanger pipeline是一种常用的商业软件包,可以自动输出这样的峰值计数。然而,分析发现这种策略不太理想,因为: (1)使用所有细胞执行峰值调用,这会偏向稀有细胞群; (2) 基因可及性通常比特定调控元件更分散,因此可能会被峰值调用算法遗漏; (3) 峰值之外的读数信息被丢弃,进一步减少了已经稀疏的测量中的数据量 。发现最简单的策略似乎效果很好,而不是求和峰值计数:计算每个细胞中每个基因的基因体和启动子区域(通常为上游 3 kb)内的 ATAC-seq 读数的总数。
必须首先使用 sort 命令行实用程序按染色体、开始和结束位置对 Fragments.tsv 进行排序。 -k 选项让用户在某个列上对文件进行排序;包括多个 -k 选项允许同时按多列排序。 -k 后面的 n 代表数字排序。这里排序的 .bed 文件顺序首先由字典染色体顺序定义(使用参数 -k1,1),然后是升序整数起始坐标顺序(使用参数 -k2,2n),最后是升序整数结束坐标顺序(使用参数-k3,3n)。
Important flags of bedmap command are as follows:
We can then use LIGER’s makeFeatureMatrix function to calculate accessibility counts for gene body and promoter indivially. This function takes the output from bedmap and efficiently counts the number of fragments overlapping each gene and promoter. We could count the genes and promoters in a single step, but choose to calculate them separately in case it is necessary to look at gene or promoter accessibility indivially in downstream analyses.
Next, these two count matrices need to be re-sorted by gene symbol. We then add the matrices together, yielding a single matrix of gene accessibility counts in each cell.
5. Once the gene-level scATAC-seq counts are generated, the read10X function from LIGER can be used to read scRNA-seq count matrices output by CellRanger. You can pass in a directory (or a list of directories) containing raw outputs (for example, /Sample_1/outs/filtered_feature_bc_matrix) to the parameter sample.dirs. Next, a vector of names to use for the sample (or samples, corresponding to sample.dirs) should be passed to parameter sample.names as well. LIGER can also use data from any other protocol, as long as it is provided in a genes x cells R matrix format.
6. We can now create a LIGER object with the createLiger function. We also remove unneeded variables to conserve memory.
7. Preprocessing steps are needed before running iNMF. Each dataset is normalized to account for differences in total gene-level counts across cells using the normalize function. Next, highly variable genes from each dataset are identified and combined for use in downstream analysis. Note that by setting the parameter datasets.use to 2, genes will be selected only from the scRNA-seq dataset (the second dataset) by the selectGenes function. We recommend not using the ATAC-seq data for variable gene selection because the statistical properties of the ATAC-seq data are very different from scRNA-seq, violating the assumptions made by the statistical model we developed for selecting genes from RNA data. Finally, the scaleNotCenter function scales normalized datasets without centering by the mean, giving the nonnegative input data required by iNMF.
8. We next perform joint matrix factorization (iNMF) on the normalized and scaled RNA and ATAC data. This step calculates metagenes–sets of co-expressed genes that distinguish cell populations–containing both shared and dataset-specific signals. The cells are then represented in terms of the expression level of each metagene, providing a low-dimensional representation that can be used for joint clustering and visualization. To run iNMF on the scaled datasets, we use the optimizeALS function with proper hyperparameter settings.
To run iNMF on the scaled datasets, use optimizeALS function with proper hyperparameters setting:
Important parameters are as follows:
9. Using the metagene factors calculated by iNMF, we then assign each cell to the factor on which it has the highest loading, giving joint clusters that correspond across datasets. We then perform quantile normalization by dataset, factor, and cluster to fully integrate the datasets. To perform this analysis, typing in:
Important parameters of quantile_norm are as follows:
10. The quantile_norm function gives joint clusters that correspond across datasets, which are often completely satisfactory and sufficient for downstream analyses. However, if desired, after quantile normalization, users can additionally run the Louvain algorithm for community detection, which is widely used in single-cell analysis and excels at merging small clusters into broad cell classes. This can be achieved by running the louvainCluster function. Several tuning parameters, including resolution, k, and prune control the number of clusters proced by this function. For this dataset, we use a resolution of 0.2, which yields 16 clusters (see below).
11. In order to visualize the clustering results, the user can use two dimensionality rection methods supported by LIGER: t-SNE and UMAP. We find that often for datasets containing continuous variation such as cell differentiation, UMAP better preserves global relationships, whereas t-SNE works well for displaying discrete cell types, such as those in the brain. The UMAP algorithm (called by the runUMAP function) scales readily to large datasets. The runTSNE function also includes an option to use FFtSNE, a highly scalable implementation of t-SNE that can efficiently process huge datasets. For the BMMC dataset, we expect to see continuous lineage transitions among the differentiating cells, so we use UMAP to visualize the data in two dimensions:
12. We can then visualize each cell, colored by cluster or dataset.
13. LIGER employs the Wilcoxon rank-sum test to identify marker genes that are differentially expressed in each cell type using the following settings. We provide parameters that allow the user to select which datasets to use (data.use) and whether to compare across clusters or across datasets within each cluster (compare.method). To identify marker genes for each cluster combining scATAC and scRNA profiles, typing in:
Important parameters of runWilcoxon are as follows:
14. The number of marker genes identified by runWilcoxon varies and depends on the datasets used. The function outputs a data frame that the user can then filter to select markers which are statistically and biologically significant. For example, one strategy is to filter the output by taking markers which have padj (Benjamini-Hochberg adjusted p-value) less than 0.05 and logFC (log fold change between observations in group versus out) larger than 3:
You can then re-sort the markers by its padj value in ascending order and choose the top 100 for each cell type. For example, we can subset and re-sort the output for Cluster 1 and take the top 20 markers by typing these commands:
15. We also provide functions to check these markers by visualizing their expression and gene loadings across datasets. You can use the plotGene to visualize the expression or accessibility of a marker gene, which is helpful for visually confirming putative marker genes or investigating the distribution of known markers across the cell sequenced. Such plots can also confirm that divergent datasets are properly aligned.
For instance, we can plot S100A9, which the Wilcoxon test identified as a marker for Cluster 1, and MS4A1, a marker for Cluster 4:
These plots indicate that S100A9 and MS4A1 are indeed specific markers for Cluster 1 and Cluster 4, respectively, with high values in these cell groups and low values elsewhere. Furthermore, we can see that the distributions are strikingly similar between the RNA and ATAC datasets, indicating that LIGER has properly aligned the two data types.
16. A key advantage of using iNMF instead of other dimensionality rection approaches such as PCA is that the dimensions are indivially interpretable. For example, a particular cell type is often captured by a single dimension of the space. Furthermore, iNMF identifies both shared and dataset-specific features along each dimension, giving insight into exactly how corresponding cells across datasets are both similar and different. The function plotGeneLoadings allows visual exploration of such information. It is recommended to call this function into a PDF file e to the large number of plots proced.
Alternatively, the function can return a list of plots. For example, we can visualize the factor loading of Factor 7 typing in:
These plots confirm that the expression and accessibility of these genes show clear differences. CCR6 shows nearly ubiquitous chromatin accessibility but is expressed only in clusters 2 and 4. The accessibility is highest in these clusters, but the ubiquitous accessibility suggests that the expression of CCR6 is somewhat decoupled from its accessibility, likely regulated by other factors. Conversely, NCF1 shows high expression in clusters 1, 3, 4, 9 and 11, despite no clear enrichment in chromatin accessibility within these clusters. This may again indicate decoupling between the expression and chromatin accessibility of NCF1. Another possibility is that the difference is e to technical effects–the gene body of NCF1 is short (~15KB), and short genes are more difficult to capture in scATAC-seq than in scRNA-seq because there are few sites for the ATAC-seq transposon to insert.
17. Single-cell measurements of chromatin accessibility and gene expression provide an unprecedented opportunity to investigate epigenetic regulation of gene expression. Ideally, such investigation would leverage paired ATAC-seq and RNA-seq from the same cells, but such simultaneous measurements are not generally available. However, using LIGER, it is possible to computationally infer pseudo-multi-omic profiles by linking scRNA-seq profiles–using the jointly inferred iNMF factors–to the most similar scATAC-seq profiles. After this imputation step, we can perform downstream analyses as if we had true single-cell multi-omic profiles. For example, we can identify putative enhancers by correlating the expression of a gene with the accessibility of neighboring intergenic peaks across the whole set of single cells.
Again, for convenience, we have prepared the pre-processed peak-level count data which is ready to use. The data can be downloaded here .
You can also follow the following tutorial to start from the beginning.
To achieve this, we first need a matrix of accessibility counts within intergenic peaks. The CellRanger pipeline for scATAC-seq outputs such a matrix by default, so we will use this as our starting point. The count matrix, peak genomic coordinates, and source cell barcodes output by CellRanger are stored in a folder named filtered_peak_matrix in the output directory. The user can load these and convert them into a peak-level count matrix by typing these commands:
18. The peak-level count matrix is usually large, containing hundreds of thousands of peaks. We next filter this set of peaks to identify those showing cell-type-specific accessibility. To do this, we perform the Wilcoxon rank-sum test and pick those peaks which are differentially accessible within a specific cluster. Before running the test, however, we need to: (1) subset the peak-level count matrix to include the same cells as the gene-level counts matrix; (2) replace the original gene-level counts matrix in the LIGER object by peak-level counts matrix; and (3) normalize peak counts to sum to 1 within each cell.
Now we can perform the Wilcoxon test:
19. We can now use the results of the Wilcoxon test to retain only peaks showing differential accessibility across our set of joint clusters. Here we kept peaks with Benjamini-Hochberg adjusted p-value 0.05 and log fold change 2.
20. Using this set of differentially accessible peaks, we now impute a set of pseudo-multi-omic profiles by inferring the intergenic peak accessibility for scRNA-seq profiles based on their nearest neighbors in the joint LIGER space. LIGER provides a function named imputeKNN that performs this task, yielding a set of profiles containing both gene expression and chromatin accessibility measurements for the same single cells:
Important parameters of imputeKNN are as follows:
21. Now that we have both the (imputed) peak-level counts matrix and the (observed) gene expression counts matrix for the same cells, we can evaluate the relationships between pairs of genes and peaks, linking genes to putative regulatory elements. We use a simple strategy to identify such gene-peak links: Calculate correlation between gene expression and peak accessibility of all peaks within 500 KB of a gene, then retain all peaks showing statistically significant correlation with the gene. The linkGenesAndPeaks function performs this analysis:
Important parameters of linkGenesAndPeaks are as follows:
22. The output of this function is a sparse matrix with peak names as rows and gene symbols as columns, with each element indicating the correlation between peak i and gene j (or 0 if the gene and peak are not significantly linked). For example, we can subset the results for marker gene S100A9, which is a marker gene of Cluster 1 identified in the previous section, and rank these peaks by their correlation:
We also provide a function to transform the peaks-gene correlation matrix into an Interact Track supported by UCSC Genome Browser for visualizing the calculated linkage between genes and correlated peaks. To do this, tying in:
Important parameters of makeInteractTrack are as follows:
The output of this function will be a UCSC Interact Track file named ‘Interact_Track.bed’ containing linkage information of the specified genes and correlated peaks stored in given directory. The user can then upload this file as a custom track using this page https://genome.ucsc.e/cgi-bin/hgCustom and display it in the UCSC Genome browser.
As an example, the three peaks most correlated to S100A9 expression are shown below in the UCSC genome browser. One of the peaks overlaps with the TSS of S100A8, a neighboring gene that is co-expressed with S100A9, while another peak overlaps with the TSS of S100A9 itself. The last peak, chr1:153358896-153359396, does not overlap with a gene body and shows strong H3K27 acetylation across ENCODE cell lines, indicating that this is likely an intergenic regulatory element.
If we plot the accessibility of this peak and the expression of S100A9, we can see that the two are indeed very correlated and show strong enrichment in clusters 1 and 3. Thus, the intergenic peak likely serves as a cell-type-specific regulator of S100A9.
生活很好,有你更好
对于grf only task effects on the statistical map来源于网友整理,仅供参考。提示:想了解更多grf only task effects on the statistical map(grf o)相关的内容,请尝试通过上方搜索框搜索。