Epiviz is an interactive and integrative web application for visual analysis and exploration of functional genomic datasets. We currently support a couple of different ways in which data can be provided to epiviz. 1) using the epivizr R/Bioconductor package, users can interactively visualize and explore genomic data loaded in R. 2) MySQL database which stores each genomic dataset as a table. Storing genomic data into a database is a challenge especially when the data sets are huge (millions of rows). When the dataset is huge, it is time consuming to import the dataset into a database table and challenging to then optimize the table for faster queries (even after table partitioning).
Based on the concepts of a NoDB paradigm, we developed the epiviz file server, a data query system on indexed genomic files. Genomic data respositories like the Roadmap Epigenomic project or the ENCODE project host the generated raw and processed datasets as files. Using the epiviz file server users will be able to visually explore data from these publicly hosted files with epiviz. We currently support BigBed, BigWig, SAM/BAM/CRAM (indexed as sai/bai/tbi) and tabix indexed files for tab separated files (gene expression) or bed, gtf, gff files indexed using tabix. We also plan to support HDF5 file format.
We use dask distributed to manage, distribute and schedule
multiple queries on genomic data files. We also require the server hosting the data files to support HTTP range requests so that
the file server's parser
module only requests the necessary byte-ranges needed to process
the query.
For this part of the tutorial, imagine a scenario where the user already has access to a list of
publicly avaialble files for example NIH's roadmap epigenomics project (File
Browser, FTP Site, metadata).
User can define a configuration file (in json
) for the list of files to query.
The following is a configuration file (roadmap.json
) to query Esophagus (E079), Sigmoid
Colon (E106), Gastric (E094) and Small Intestine (E106) ChipSeq Data for "H3K27me3" histone marker files.
Most fields in the configuration file are self explanatory(url, file_type (bigwig, bigbed etc), name of
the dataset and any annotation that can be associated with the file.
{json}
[
{
url: "https://egg2.wustl.edu/roadmap/data/byFileType/signal/consolidated/macs2signal/foldChange/E079-H3K27me3.fc.signal.bigwig",
file_type: "bigwig",
datatype: "bp",
name: "E079-H3K27me3",
annotation: {
group: "digestive",
tissue: "Esophagus",
marker: "H3K27me3"
}
}, {
url: "https://egg2.wustl.edu/roadmap/data/byFileType/signal/consolidated/macs2signal/foldChange/E106-H3K27me3.fc.signal.bigwig",
file_type: "bigwig",
datatype: "bp",
name: "E106-H3K27me3",
annotation: {
group: "digestive",
tissue: "Sigmoid Colon",
marker: "H3K27me3"
}
}, {
url: "https://egg2.wustl.edu/roadmap/data/byFileType/signal/consolidated/macs2signal/foldChange/E094-H3K27me3.fc.signal.bigwig",
file_type: "bigwig",
datatype: "bp",
name: "E094-H3K27me3",
annotation: {
group: "digestive",
tissue: "Gastric",
marker: "H3K27me3"
}
}, {
url: "https://egg2.wustl.edu/roadmap/data/byFileType/signal/consolidated/macs2signal/foldChange/E109-H3K27me3.fc.signal.bigwig",
file_type: "bigwig",
datatype: "bp",
name: "E109-H3K27me3",
annotation: {
group: "digestive",
tissue: "Small Intestine",
marker: "H3K27me3"
}
}
]
Now that we have the configuration file, we can now load the measurements from these files using the
measurements
module from the file parser
from measurements import MeasurementManager
import os
# create measurements manager
mMgr = MeasurementManager()
# import measurements from json
file_measurements = mMgr.import_files(os.getcwd() + "/roadmap.json")
file_measurements[1:]
we can use the get_data
function to query the file for a given genomic region. The queries
are processed asynchronously on the dask server and hence the use of the await
keyword
result, _ = await file_measurements[0].get_data("chr11", 10550488, 11554489)
result.head()
We can write a custom statistical function to apply on each row of a dataset or use any of the available
functions from numpy
, pandas
, or other python packages. We call these derived
measurements from existing files as computed measurements. In this example, we use the numpy.mean
to compute average ChIPSeq expression of digestive
group across the following tissues (Sigmoid Colon
,
Esophagus
, Gastric
& Small Intestine
). We first create a
computed measurement using the measurements we loaded from the file earlier -
import numpy
computed_measurement = mMgr.add_computed_measurement("computed", "ChipSeq_avg_digestive", "ChipSeq average digestive tissue",
measurements=file_measurements, computeFunc=numpy.mean)
computed_measurement
We can also query the computed measurement to get data for a given genomic region. This will inturn query the measurements and compute the statistical function.
result,_ = await computed_measurement.get_data("chr11", 10550488, 11554489)
result.head()
To convert the above measurements into a web API, we wrote a helper function setup_app
to
intialize the web server. We use Python Sanic as the
web server framework to manage web requests.
Note: This might not work properly if running inside Jupyter/IPython notebooks since notebooks already run on a local webserver.
from server import setup_app
app = setup_app(mMgr)
app.run(port=8000)
The Web API provides a couple of methods to get a list of all measurements served with the API and the get Data calls to query data for a specific genomic region.
To query for list of all measurements/files served through the API (action = getMeasurements
)
import requests
import pandas
resp = requests.get("http://localhost:8000/?requestId=0&action=getMeasurements")
df = pandas.DataFrame(resp.json()["data"])
df
We'll send a request to the web API to query the computed measurement (action=getData
)
Query Parameters:
action: getData
chr, start, end: genomic region of interest
measurement: ID from the getMeasurements request
datasource: datasourceGroup from getMeasurements request
Note: To keep the notebook short we are only printing the response status code. To print the entire json
response, use resp.json()
resp = requests.get("http://localhost:8000/?requestId=0&action=getData&start=10550488&end=11554489&seqName=chr11&measurement=ChipSeq_avg_digestive&datasource=computed")
resp.status_code
We can use the epiviz web components to visualize
data queries from the file server api. For this dataset, we will visualize the signal data using a line
track. We create a epiviz-line-track
component with the json-data assigned from the previous
response.
Note: if using IPython notebook, the chart is visualized after exporting as HTML.
%%html
<script src="bower_components/jquery/dist/jquery.js"></script>
<script src="bower_components/jquery-ui/jquery-ui.js"></script>
<script src="bower_components/webcomponentsjs/webcomponents-lite.js"></script>
<link rel="import" href="bower_components/epiviz-charts/epiviz-charts.html">
from IPython.display import HTML, IFrame
import ujson
HTML("<epiviz-line-track dim-s=['ChipSeq_avg_digestive'] json-data='" + ujson.dumps(resp.json()["data"]) + "'></epiviz-line-track>")
We can also use the Bioconductor's AnnotationHub to search for files and showcase the features of the file server. Annotation Hub API is hosted at https://annotationhub.bioconductor.org/. We first download the annotationhub sqlite database for available data resources.
Note: Jupyter notebooks support execution of shell commands using the !
!wget http://annotationhub.bioconductor.org/metadata/annotationhub.sqlite3
import pandas
import os
import sqlite3
Query the downloaded annotationhub sqlite database for available resources.
conn = sqlite3.connect("annotationhub.sqlite3")
cur = conn.cursor()
cur.execute("select * from resources r JOIN input_sources inp_src ON r.id = inp_src.resource_id;")
results = cur.fetchall()
pd = pandas.DataFrame(results, columns = ["id", "ah_id", "title", "dataprovider", "species", "taxonomyid", "genome",
"description", "coordinate_1_based", "maintainer", "status_id",
"location_prefix_id", "recipe_id", "rdatadateadded", "rdatadateremoved",
"record_id", "preparerclass", "id", "sourcesize", "sourceurl", "sourceversion",
"sourcemd5", "sourcelastmodifieddate", "resource_id", "source_type"])
pd.head()
For the purpose of the tutorial, we will filter for Sigmoid Colon ("E106") and Esophagus ("E079") tissues, and the ChipSeq Data for "H3K27me3" histone marker files from the roadmap epigenomics project.
metadata file is available at - https://egg2.wustl.edu/roadmap/web_portal/meta.html
roadmap = pd.query('dataprovider=="BroadInstitute" and genome=="hg19"')
roadmap = roadmap.query('title.str.contains("H3K27me3") and (title.str.contains("E106") or title.str.contains("E079"))')
# only use fc files
roadmap = roadmap.query('title.str.contains("fc")')
roadmap
The measurements manager provides a function to import annotation hub resources. We can also query, create computed measurements, convert to a web API and visualize these resources.
# create measurements manager
mMgr = MeasurementManager()
# import measurements from json
file_measurements = mMgr.import_ahub(roadmap)
file_measurements
The epiviz file server is available on GitHub. Let us know what you think and any feedback to improve the library!