Querying and computing on genomic data files using the epiviz file server

Table of Contents

Epiviz is an interactive and integrative web application for visual analysis and exploration of functional genomic datasets. We currently support a couple of different ways in which data can be provided to epiviz. 1) using the epivizr R/Bioconductor package, users can interactively visualize and explore genomic data loaded in R. 2) MySQL database which stores each genomic dataset as a table. Storing genomic data into a database is a challenge especially when the data sets are huge (millions of rows). When the dataset is huge, it is time consuming to import the dataset into a database table and challenging to then optimize the table for faster queries (even after table partitioning).

Based on the concepts of a NoDB paradigm, we developed the epiviz file server, a data query system on indexed genomic files. Genomic data respositories like the Roadmap Epigenomic project or the ENCODE project host the generated raw and processed datasets as files. Using the epiviz file server users will be able to visually explore data from these publicly hosted files with epiviz. We currently support BigBed, BigWig, SAM/BAM/CRAM (indexed as sai/bai/tbi) and tabix indexed files for tab separated files (gene expression) or bed, gtf, gff files indexed using tabix. We also plan to support HDF5 file format.

We use dask distributed to manage, distribute and schedule multiple queries on genomic data files. We also require the server hosting the data files to support HTTP range requests so that the file server's parser module only requests the necessary byte-ranges needed to process the query.

Querying a list of publicly available files

For this part of the tutorial, imagine a scenario where the user already has access to a list of publicly avaialble files for example NIH's roadmap epigenomics project (File Browser, FTP Site, metadata). User can define a configuration file (in json) for the list of files to query.

The following is a configuration file (roadmap.json) to query Esophagus (E079), Sigmoid Colon (E106), Gastric (E094) and Small Intestine (E106) ChipSeq Data for "H3K27me3" histone marker files. Most fields in the configuration file are self explanatory(url, file_type (bigwig, bigbed etc), name of the dataset and any annotation that can be associated with the file.

{json}
[
  {
    url: "https://egg2.wustl.edu/roadmap/data/byFileType/signal/consolidated/macs2signal/foldChange/E079-H3K27me3.fc.signal.bigwig",
    file_type: "bigwig",
    datatype: "bp",
    name: "E079-H3K27me3",
    annotation: {
        group: "digestive",
        tissue: "Esophagus",
        marker: "H3K27me3"
    }
  }, {
    url: "https://egg2.wustl.edu/roadmap/data/byFileType/signal/consolidated/macs2signal/foldChange/E106-H3K27me3.fc.signal.bigwig",
    file_type: "bigwig",
    datatype: "bp",
    name: "E106-H3K27me3",
    annotation: {
        group: "digestive",
        tissue: "Sigmoid Colon",
        marker: "H3K27me3"
    }
  }, {
    url: "https://egg2.wustl.edu/roadmap/data/byFileType/signal/consolidated/macs2signal/foldChange/E094-H3K27me3.fc.signal.bigwig",
    file_type: "bigwig",
    datatype: "bp",
    name: "E094-H3K27me3",
    annotation: {
        group: "digestive",
        tissue: "Gastric",
        marker: "H3K27me3"
    }
  }, {
    url: "https://egg2.wustl.edu/roadmap/data/byFileType/signal/consolidated/macs2signal/foldChange/E109-H3K27me3.fc.signal.bigwig",
    file_type: "bigwig",
    datatype: "bp",
    name: "E109-H3K27me3",
    annotation: {
        group: "digestive",
        tissue: "Small Intestine",
        marker: "H3K27me3"
    }
  }
]

Now that we have the configuration file, we can now load the measurements from these files using the measurements module from the file parser

In [5]:
from measurements import MeasurementManager
import os

# create measurements manager
mMgr = MeasurementManager()

# import measurements from json
file_measurements = mMgr.import_files(os.getcwd() + "/roadmap.json")
file_measurements[1:]
Out[5]:
[<measurements.measurementClass.FileMeasurement at 0x7f7cf0187f28>,
 <measurements.measurementClass.FileMeasurement at 0x7f7cebde6940>,
 <measurements.measurementClass.FileMeasurement at 0x7f7cebde6978>]

Query measurements

we can use the get_data function to query the file for a given genomic region. The queries are processed asynchronously on the dask server and hence the use of the await keyword

In [6]:
result, _ = await file_measurements[0].get_data("chr11", 10550488, 11554489)
result.head()
Out[6]:
chr start end E079-H3K27me3
0 chr11 10550272 10550784 0.586341
1 chr11 10550784 10551296 0.561396
2 chr11 10551296 10551808 0.122578
3 chr11 10551808 10552320 0.385395
4 chr11 10552320 10552832 0.429076

Compute a function over files

We can write a custom statistical function to apply on each row of a dataset or use any of the available functions from numpy, pandas, or other python packages. We call these derived measurements from existing files as computed measurements. In this example, we use the numpy.mean to compute average ChIPSeq expression of digestive group across the following tissues (Sigmoid Colon, Esophagus, Gastric & Small Intestine). We first create a computed measurement using the measurements we loaded from the file earlier -

In [7]:
import numpy
computed_measurement = mMgr.add_computed_measurement("computed", "ChipSeq_avg_digestive", "ChipSeq average digestive tissue", 
                                                     measurements=file_measurements, computeFunc=numpy.mean)
computed_measurement
Out[7]:
<measurements.measurementClass.ComputedMeasurement at 0x7f7d24cf7f28>

We can also query the computed measurement to get data for a given genomic region. This will inturn query the measurements and compute the statistical function.

In [8]:
result,_ = await computed_measurement.get_data("chr11", 10550488, 11554489)
result.head()
Out[8]:
chr E079-H3K27me3 start end E106-H3K27me3 E094-H3K27me3 E109-H3K27me3 ChipSeq_avg_digestive
(10550488, 10550990] chr11 0.561396 10550488 10550990 0.110963 0.177986 0.260039 0.277596
(10550990, 10551492] chr11 0.122578 10550990 10551492 0.353104 1.09457 0 0.392563
(10551492, 10551994] chr11 0.385395 10551492 10551994 0.322769 0.349998 0.500089 0.389563
(10551994, 10552496] chr11 0.429076 10551994 10552496 0.466317 0.411907 0.00547445 0.328194
(10552496, 10552998] chr11 0.122049 10552496 10552998 0.329585 0.990553 0.17792 0.405027

Setting up the web API

To convert the above measurements into a web API, we wrote a helper function setup_app to intialize the web server. We use Python Sanic as the web server framework to manage web requests.

Note: This might not work properly if running inside Jupyter/IPython notebooks since notebooks already run on a local webserver.

In [ ]:
from server import setup_app
app = setup_app(mMgr)
app.run(port=8000)
Querying the web API

The Web API provides a couple of methods to get a list of all measurements served with the API and the get Data calls to query data for a specific genomic region.

List of all measurements served through the API

To query for list of all measurements/files served through the API (action = getMeasurements)

In [1]:
import requests
import pandas
resp = requests.get("http://localhost:8000/?requestId=0&action=getMeasurements")
df = pandas.DataFrame(resp.json()["data"])
df
Out[1]:
annotation datasourceGroup datasourceId defaultChartType id maxValue minValue name type metadata
0 {'group': 'digestive', 'tissue': 'Esophagus', ... files http://egg2.wustl.edu/roadmap/data/byFileType/... track E079-H3K27me3 0.0 5.0 E079-H3K27me3 feature None
1 {'group': 'digestive', 'tissue': 'Sigmoid Colo... files http://egg2.wustl.edu/roadmap/data/byFileType/... track E106-H3K27me3 0.0 5.0 E106-H3K27me3 feature None
2 {'group': 'digestive', 'tissue': 'Gastric', 'm... files http://egg2.wustl.edu/roadmap/data/byFileType/... track E094-H3K27me3 0.0 5.0 E094-H3K27me3 feature None
3 {'group': 'digestive', 'tissue': 'Small Intest... files http://egg2.wustl.edu/roadmap/data/byFileType/... track E109-H3K27me3 0.0 5.0 E109-H3K27me3 feature None
4 None computed computed track ChipSeq_avg_digestive NaN NaN ChipSeq average digestive tissue feature None
Query for data from a measurements

We'll send a request to the web API to query the computed measurement (action=getData)

Query Parameters:

action:               getData
chr, start, end:      genomic region of interest
measurement:          ID from the getMeasurements request
datasource:           datasourceGroup from getMeasurements request

Note: To keep the notebook short we are only printing the response status code. To print the entire json response, use resp.json()

In [2]:
resp = requests.get("http://localhost:8000/?requestId=0&action=getData&start=10550488&end=11554489&seqName=chr11&measurement=ChipSeq_avg_digestive&datasource=computed")
resp.status_code
Out[2]:
200

Visualize data from files

We can use the epiviz web components to visualize data queries from the file server api. For this dataset, we will visualize the signal data using a line track. We create a epiviz-line-track component with the json-data assigned from the previous response.

Note: if using IPython notebook, the chart is visualized after exporting as HTML.

In [9]:
%%html
<script src="bower_components/jquery/dist/jquery.js"></script>
<script src="bower_components/jquery-ui/jquery-ui.js"></script>
<script src="bower_components/webcomponentsjs/webcomponents-lite.js"></script>
<link rel="import" href="bower_components/epiviz-charts/epiviz-charts.html">
In [15]:
from IPython.display import HTML, IFrame
import ujson

HTML("<epiviz-line-track dim-s=['ChipSeq_avg_digestive'] json-data='" + ujson.dumps(resp.json()["data"]) + "'></epiviz-line-track>")
Out[15]:

Query using Annotation Hub

We can also use the Bioconductor's AnnotationHub to search for files and showcase the features of the file server. Annotation Hub API is hosted at https://annotationhub.bioconductor.org/. We first download the annotationhub sqlite database for available data resources.

Note: Jupyter notebooks support execution of shell commands using the !

In [ ]:
!wget http://annotationhub.bioconductor.org/metadata/annotationhub.sqlite3
In [11]:
import pandas
import os
import sqlite3

Query the downloaded annotationhub sqlite database for available resources.

In [12]:
conn = sqlite3.connect("annotationhub.sqlite3")
cur = conn.cursor()
cur.execute("select * from resources r JOIN input_sources inp_src ON r.id = inp_src.resource_id;")
results = cur.fetchall()
pd = pandas.DataFrame(results, columns = ["id", "ah_id", "title", "dataprovider", "species", "taxonomyid", "genome", 
                                          "description", "coordinate_1_based", "maintainer", "status_id",
                                          "location_prefix_id", "recipe_id", "rdatadateadded", "rdatadateremoved",
                                          "record_id", "preparerclass", "id", "sourcesize", "sourceurl", "sourceversion",
                                          "sourcemd5", "sourcelastmodifieddate", "resource_id", "source_type"])
pd.head()
Out[12]:
id ah_id title dataprovider species taxonomyid genome description coordinate_1_based maintainer ... record_id preparerclass id sourcesize sourceurl sourceversion sourcemd5 sourcelastmodifieddate resource_id source_type
0 2 AH2 Ailuropoda_melanoleuca.ailMel1.69.dna.toplevel.fa Ensembl Ailuropoda melanoleuca 9646.0 ailMel1 FASTA DNA sequence for Ailuropoda melanoleuca 1 Martin Morgan <mtmorgan@fhcrc.org> ... 0 EnsemblFastaImportPreparer 2 693412448 ftp://ftp.ensembl.org/pub/release-69/fasta/ail... release-69 None 2012-10-12 2 FASTA
1 3 AH3 Ailuropoda_melanoleuca.ailMel1.69.dna_rm.tople... Ensembl Ailuropoda melanoleuca 9646.0 ailMel1 FASTA DNA sequence for Ailuropoda melanoleuca 1 Martin Morgan <mtmorgan@fhcrc.org> ... 1 EnsemblFastaImportPreparer 3 465381938 ftp://ftp.ensembl.org/pub/release-69/fasta/ail... release-69 None 2012-10-12 3 FASTA
2 4 AH4 Ailuropoda_melanoleuca.ailMel1.69.dna_sm.tople... Ensembl Ailuropoda melanoleuca 9646.0 ailMel1 FASTA DNA sequence for Ailuropoda melanoleuca 1 Martin Morgan <mtmorgan@fhcrc.org> ... 2 EnsemblFastaImportPreparer 4 742059555 ftp://ftp.ensembl.org/pub/release-69/fasta/ail... release-69 None 2012-10-12 4 FASTA
3 5 AH5 Ailuropoda_melanoleuca.ailMel1.69.ncrna.fa Ensembl Ailuropoda melanoleuca 9646.0 ailMel1 FASTA ncRNA sequence for Ailuropoda melanoleuca 1 Martin Morgan <mtmorgan@fhcrc.org> ... 3 EnsemblFastaImportPreparer 5 181232 ftp://ftp.ensembl.org/pub/release-69/fasta/ail... release-69 None 2012-10-16 5 FASTA
4 6 AH6 Ailuropoda_melanoleuca.ailMel1.69.pep.all.fa Ensembl Ailuropoda melanoleuca 9646.0 ailMel1 FASTA pep sequence for Ailuropoda melanoleuca 1 Martin Morgan <mtmorgan@fhcrc.org> ... 4 EnsemblFastaImportPreparer 6 7152696 ftp://ftp.ensembl.org/pub/release-69/fasta/ail... release-69 None 2012-10-16 6 FASTA

5 rows × 25 columns

For the purpose of the tutorial, we will filter for Sigmoid Colon ("E106") and Esophagus ("E079") tissues, and the ChipSeq Data for "H3K27me3" histone marker files from the roadmap epigenomics project.

metadata file is available at - https://egg2.wustl.edu/roadmap/web_portal/meta.html

In [13]:
roadmap = pd.query('dataprovider=="BroadInstitute" and genome=="hg19"')
roadmap = roadmap.query('title.str.contains("H3K27me3") and (title.str.contains("E106") or title.str.contains("E079"))')
# only use fc files
roadmap = roadmap.query('title.str.contains("fc")')
roadmap
Out[13]:
id ah_id title dataprovider species taxonomyid genome description coordinate_1_based maintainer ... record_id preparerclass id sourcesize sourceurl sourceversion sourcemd5 sourcelastmodifieddate resource_id source_type
35713 32614 AH32614 E079-H3K27me3.fc.signal.bigwig BroadInstitute Homo sapiens 9606.0 hg19 Bigwig File containing fold enrichment signal ... 0 Bioconductor Maintainer <maintainer@bioconduct... ... 23320 EpigenomeRoadMapPreparer 38349 574504704 http://egg2.wustl.edu/roadmap/data/byFileType/... 2013-09-02_00:47:26 None 2013-09-02 32614 BigWig
35894 32795 AH32795 E106-H3K27me3.fc.signal.bigwig BroadInstitute Homo sapiens 9606.0 hg19 Bigwig File containing fold enrichment signal ... 0 Bioconductor Maintainer <maintainer@bioconduct... ... 23501 EpigenomeRoadMapPreparer 38530 599027743 http://egg2.wustl.edu/roadmap/data/byFileType/... 2013-09-02_00:31:26 None 2013-09-02 32795 BigWig

2 rows × 25 columns

The measurements manager provides a function to import annotation hub resources. We can also query, create computed measurements, convert to a web API and visualize these resources.

In [14]:
# create measurements manager
mMgr = MeasurementManager()

# import measurements from json
file_measurements = mMgr.import_ahub(roadmap)
file_measurements
Out[14]:
[<measurements.measurementClass.FileMeasurement at 0x7f7cebc51240>,
 <measurements.measurementClass.FileMeasurement at 0x7f7cebc515c0>]

The epiviz file server is available on GitHub. Let us know what you think and any feedback to improve the library!