Querying and computing on genomic data files using the epiviz file server

Epiviz is an interactive and integrative web application for visual analysis and exploration of functional genomic datasets. We currently support a couple of different ways in which data can be provided to epiviz. 1) using the epivizr R/Bioconductor package, users can interactively visualize and explore genomic data loaded in R. 2) MySQL database which stores each genomic dataset as a table. Storing genomic data into a database is a challenge especially when the data sets are huge (millions of rows). When the dataset is huge, it is time consuming to import the dataset into a database table and challenging to then optimize the table for faster queries (even after table partitioning).

Based on the concepts of a NoDB paradigm, we developed the epiviz file server, a data query system on indexed genomic files. Genomic data respositories like the Roadmap Epigenomic project or the ENCODE project host the generated raw and processed datasets as files. Using the epiviz file server users will be able to visually explore data from these publicly hosted files with epiviz. We currently support BigBed, BigWig, SAM/BAM/CRAM (indexed as sai/bai/tbi) and tabix indexed files for tab separated files (gene expression) or bed, gtf, gff files indexed using tabix. We also plan to support HDF5 file format.

We use dask distributed to manage, distribute and schedule multiple queries on genomic data files. We also require the server hosting the data files to support HTTP range requests so that the file server's parser module only requests the necessary byte-ranges needed to process the query.

Querying a list of publicly available files¶

For this part of the tutorial, imagine a scenario where the user already has access to a list of publicly avaialble files for example NIH's roadmap epigenomics project (File Browser, FTP Site, metadata). User can define a configuration file (in json) for the list of files to query.

The following is a configuration file (roadmap.json) to query Esophagus (E079), Sigmoid Colon (E106), Gastric (E094) and Small Intestine (E106) ChipSeq Data for "H3K27me3" histone marker files. Most fields in the configuration file are self explanatory(url, file_type (bigwig, bigbed etc), name of the dataset and any annotation that can be associated with the file.

{json}
[
  {
    url: "https://egg2.wustl.edu/roadmap/data/byFileType/signal/consolidated/macs2signal/foldChange/E079-H3K27me3.fc.signal.bigwig",
    file_type: "bigwig",
    datatype: "bp",
    name: "E079-H3K27me3",
    annotation: {
        group: "digestive",
        tissue: "Esophagus",
        marker: "H3K27me3"
    }
  }, {
    url: "https://egg2.wustl.edu/roadmap/data/byFileType/signal/consolidated/macs2signal/foldChange/E106-H3K27me3.fc.signal.bigwig",
    file_type: "bigwig",
    datatype: "bp",
    name: "E106-H3K27me3",
    annotation: {
        group: "digestive",
        tissue: "Sigmoid Colon",
        marker: "H3K27me3"
    }
  }, {
    url: "https://egg2.wustl.edu/roadmap/data/byFileType/signal/consolidated/macs2signal/foldChange/E094-H3K27me3.fc.signal.bigwig",
    file_type: "bigwig",
    datatype: "bp",
    name: "E094-H3K27me3",
    annotation: {
        group: "digestive",
        tissue: "Gastric",
        marker: "H3K27me3"
    }
  }, {
    url: "https://egg2.wustl.edu/roadmap/data/byFileType/signal/consolidated/macs2signal/foldChange/E109-H3K27me3.fc.signal.bigwig",
    file_type: "bigwig",
    datatype: "bp",
    name: "E109-H3K27me3",
    annotation: {
        group: "digestive",
        tissue: "Small Intestine",
        marker: "H3K27me3"
    }
  }
]

Now that we have the configuration file, we can now load the measurements from these files using the measurements module from the file parser

from measurements import MeasurementManager
import os

# create measurements manager
mMgr = MeasurementManager()

# import measurements from json
file_measurements = mMgr.import_files(os.getcwd() + "/roadmap.json")
file_measurements[1:]

[<measurements.measurementClass.FileMeasurement at 0x7f7cf0187f28>,
 <measurements.measurementClass.FileMeasurement at 0x7f7cebde6940>,
 <measurements.measurementClass.FileMeasurement at 0x7f7cebde6978>]

Query measurements¶

we can use the get_data function to query the file for a given genomic region. The queries are processed asynchronously on the dask server and hence the use of the await keyword

result, _ = await file_measurements[0].get_data("chr11", 10550488, 11554489)
result.head()

Compute a function over files¶

We can write a custom statistical function to apply on each row of a dataset or use any of the available functions from numpy, pandas, or other python packages. We call these derived measurements from existing files as computed measurements. In this example, we use the numpy.mean to compute average ChIPSeq expression of digestive group across the following tissues (Sigmoid Colon, Esophagus, Gastric & Small Intestine). We first create a computed measurement using the measurements we loaded from the file earlier -

import numpy
computed_measurement = mMgr.add_computed_measurement("computed", "ChipSeq_avg_digestive", "ChipSeq average digestive tissue", 
                                                     measurements=file_measurements, computeFunc=numpy.mean)
computed_measurement

<measurements.measurementClass.ComputedMeasurement at 0x7f7d24cf7f28>

We can also query the computed measurement to get data for a given genomic region. This will inturn query the measurements and compute the statistical function.

result,_ = await computed_measurement.get_data("chr11", 10550488, 11554489)
result.head()

Setting up the web API¶

To convert the above measurements into a web API, we wrote a helper function setup_app to intialize the web server. We use Python Sanic as the web server framework to manage web requests.

Note: This might not work properly if running inside Jupyter/IPython notebooks since notebooks already run on a local webserver.

from server import setup_app
app = setup_app(mMgr)
app.run(port=8000)

Querying the web API¶

The Web API provides a couple of methods to get a list of all measurements served with the API and the get Data calls to query data for a specific genomic region.

List of all measurements served through the API¶

To query for list of all measurements/files served through the API (action = getMeasurements)

import requests
import pandas
resp = requests.get("http://localhost:8000/?requestId=0&action=getMeasurements")
df = pandas.DataFrame(resp.json()["data"])
df

Query for data from a measurements¶

We'll send a request to the web API to query the computed measurement (action=getData)

Query Parameters:

action:               getData
chr, start, end:      genomic region of interest
measurement:          ID from the getMeasurements request
datasource:           datasourceGroup from getMeasurements request

Note: To keep the notebook short we are only printing the response status code. To print the entire json response, use resp.json()

resp = requests.get("http://localhost:8000/?requestId=0&action=getData&start=10550488&end=11554489&seqName=chr11&measurement=ChipSeq_avg_digestive&datasource=computed")
resp.status_code

200

Visualize data from files¶

We can use the epiviz web components to visualize data queries from the file server api. For this dataset, we will visualize the signal data using a line track. We create a epiviz-line-track component with the json-data assigned from the previous response.

Note: if using IPython notebook, the chart is visualized after exporting as HTML.

%%html
<script src="bower_components/jquery/dist/jquery.js"></script>
<script src="bower_components/jquery-ui/jquery-ui.js"></script>
<script src="bower_components/webcomponentsjs/webcomponents-lite.js"></script>
<link rel="import" href="bower_components/epiviz-charts/epiviz-charts.html">

from IPython.display import HTML, IFrame
import ujson

HTML("<epiviz-line-track dim-s=['ChipSeq_avg_digestive'] json-data='" + ujson.dumps(resp.json()["data"]) + "'></epiviz-line-track>")

Query using Annotation Hub¶

We can also use the Bioconductor's AnnotationHub to search for files and showcase the features of the file server. Annotation Hub API is hosted at https://annotationhub.bioconductor.org/. We first download the annotationhub sqlite database for available data resources.

Note: Jupyter notebooks support execution of shell commands using the !

!wget http://annotationhub.bioconductor.org/metadata/annotationhub.sqlite3

import pandas
import os
import sqlite3

Query the downloaded annotationhub sqlite database for available resources.

conn = sqlite3.connect("annotationhub.sqlite3")
cur = conn.cursor()
cur.execute("select * from resources r JOIN input_sources inp_src ON r.id = inp_src.resource_id;")
results = cur.fetchall()
pd = pandas.DataFrame(results, columns = ["id", "ah_id", "title", "dataprovider", "species", "taxonomyid", "genome", 
                                          "description", "coordinate_1_based", "maintainer", "status_id",
                                          "location_prefix_id", "recipe_id", "rdatadateadded", "rdatadateremoved",
                                          "record_id", "preparerclass", "id", "sourcesize", "sourceurl", "sourceversion",
                                          "sourcemd5", "sourcelastmodifieddate", "resource_id", "source_type"])
pd.head()

For the purpose of the tutorial, we will filter for Sigmoid Colon ("E106") and Esophagus ("E079") tissues, and the ChipSeq Data for "H3K27me3" histone marker files from the roadmap epigenomics project.

metadata file is available at - https://egg2.wustl.edu/roadmap/web_portal/meta.html

roadmap = pd.query('dataprovider=="BroadInstitute" and genome=="hg19"')
roadmap = roadmap.query('title.str.contains("H3K27me3") and (title.str.contains("E106") or title.str.contains("E079"))')
# only use fc files
roadmap = roadmap.query('title.str.contains("fc")')
roadmap

The measurements manager provides a function to import annotation hub resources. We can also query, create computed measurements, convert to a web API and visualize these resources.

# create measurements manager
mMgr = MeasurementManager()

# import measurements from json
file_measurements = mMgr.import_ahub(roadmap)
file_measurements

[<measurements.measurementClass.FileMeasurement at 0x7f7cebc51240>,
 <measurements.measurementClass.FileMeasurement at 0x7f7cebc515c0>]

The epiviz file server is available on GitHub. Let us know what you think and any feedback to improve the library!

	chr	start	end	E079-H3K27me3
0	chr11	10550272	10550784	0.586341
1	chr11	10550784	10551296	0.561396
2	chr11	10551296	10551808	0.122578
3	chr11	10551808	10552320	0.385395
4	chr11	10552320	10552832	0.429076

	chr	E079-H3K27me3	start	end	E106-H3K27me3	E094-H3K27me3	E109-H3K27me3	ChipSeq_avg_digestive
(10550488, 10550990]	chr11	0.561396	10550488	10550990	0.110963	0.177986	0.260039	0.277596
(10550990, 10551492]	chr11	0.122578	10550990	10551492	0.353104	1.09457	0	0.392563
(10551492, 10551994]	chr11	0.385395	10551492	10551994	0.322769	0.349998	0.500089	0.389563
(10551994, 10552496]	chr11	0.429076	10551994	10552496	0.466317	0.411907	0.00547445	0.328194
(10552496, 10552998]	chr11	0.122049	10552496	10552998	0.329585	0.990553	0.17792	0.405027

	id	ah_id	title	dataprovider	species	taxonomyid	genome	description	coordinate_1_based	maintainer	...	record_id	preparerclass	id	sourcesize	sourceurl	sourceversion	sourcemd5	sourcelastmodifieddate	resource_id	source_type
0	2	AH2	Ailuropoda_melanoleuca.ailMel1.69.dna.toplevel.fa	Ensembl	Ailuropoda melanoleuca	9646.0	ailMel1	FASTA DNA sequence for Ailuropoda melanoleuca	1	Martin Morgan <mtmorgan@fhcrc.org>	...	0	EnsemblFastaImportPreparer	2	693412448	ftp://ftp.ensembl.org/pub/release-69/fasta/ail...	release-69	None	2012-10-12	2	FASTA
1	3	AH3	Ailuropoda_melanoleuca.ailMel1.69.dna_rm.tople...	Ensembl	Ailuropoda melanoleuca	9646.0	ailMel1	FASTA DNA sequence for Ailuropoda melanoleuca	1	Martin Morgan <mtmorgan@fhcrc.org>	...	1	EnsemblFastaImportPreparer	3	465381938	ftp://ftp.ensembl.org/pub/release-69/fasta/ail...	release-69	None	2012-10-12	3	FASTA
2	4	AH4	Ailuropoda_melanoleuca.ailMel1.69.dna_sm.tople...	Ensembl	Ailuropoda melanoleuca	9646.0	ailMel1	FASTA DNA sequence for Ailuropoda melanoleuca	1	Martin Morgan <mtmorgan@fhcrc.org>	...	2	EnsemblFastaImportPreparer	4	742059555	ftp://ftp.ensembl.org/pub/release-69/fasta/ail...	release-69	None	2012-10-12	4	FASTA
3	5	AH5	Ailuropoda_melanoleuca.ailMel1.69.ncrna.fa	Ensembl	Ailuropoda melanoleuca	9646.0	ailMel1	FASTA ncRNA sequence for Ailuropoda melanoleuca	1	Martin Morgan <mtmorgan@fhcrc.org>	...	3	EnsemblFastaImportPreparer	5	181232	ftp://ftp.ensembl.org/pub/release-69/fasta/ail...	release-69	None	2012-10-16	5	FASTA
4	6	AH6	Ailuropoda_melanoleuca.ailMel1.69.pep.all.fa	Ensembl	Ailuropoda melanoleuca	9646.0	ailMel1	FASTA pep sequence for Ailuropoda melanoleuca	1	Martin Morgan <mtmorgan@fhcrc.org>	...	4	EnsemblFastaImportPreparer	6	7152696	ftp://ftp.ensembl.org/pub/release-69/fasta/ail...	release-69	None	2012-10-16	6	FASTA

	id	ah_id	title	dataprovider	species	taxonomyid	genome	description	coordinate_1_based	maintainer	...	record_id	preparerclass	id	sourcesize	sourceurl	sourceversion	sourcemd5	sourcelastmodifieddate	resource_id	source_type
35713	32614	AH32614	E079-H3K27me3.fc.signal.bigwig	BroadInstitute	Homo sapiens	9606.0	hg19	Bigwig File containing fold enrichment signal ...	0	Bioconductor Maintainer <maintainer@bioconduct...	...	23320	EpigenomeRoadMapPreparer	38349	574504704	http://egg2.wustl.edu/roadmap/data/byFileType/...	2013-09-02_00:47:26	None	2013-09-02	32614	BigWig
35894	32795	AH32795	E106-H3K27me3.fc.signal.bigwig	BroadInstitute	Homo sapiens	9606.0	hg19	Bigwig File containing fold enrichment signal ...	0	Bioconductor Maintainer <maintainer@bioconduct...	...	23501	EpigenomeRoadMapPreparer	38530	599027743	http://egg2.wustl.edu/roadmap/data/byFileType/...	2013-09-02_00:31:26	None	2013-09-02	32795	BigWig

	annotation	datasourceGroup	datasourceId	defaultChartType	id	maxValue	minValue	name	type	metadata
0	{'group': 'digestive', 'tissue': 'Esophagus', ...	files	http://egg2.wustl.edu/roadmap/data/byFileType/...	track	E079-H3K27me3	0.0	5.0	E079-H3K27me3	feature	None
1	{'group': 'digestive', 'tissue': 'Sigmoid Colo...	files	http://egg2.wustl.edu/roadmap/data/byFileType/...	track	E106-H3K27me3	0.0	5.0	E106-H3K27me3	feature	None
2	{'group': 'digestive', 'tissue': 'Gastric', 'm...	files	http://egg2.wustl.edu/roadmap/data/byFileType/...	track	E094-H3K27me3	0.0	5.0	E094-H3K27me3	feature	None
3	{'group': 'digestive', 'tissue': 'Small Intest...	files	http://egg2.wustl.edu/roadmap/data/byFileType/...	track	E109-H3K27me3	0.0	5.0	E109-H3K27me3	feature	None
4	None	computed	computed	track	ChipSeq_avg_digestive	NaN	NaN	ChipSeq average digestive tissue	feature	None