ChEMBL Resources

The SARfaris: GPCR, Kinase, ADME

Wednesday, 5 April 2017

Technical internships at ChEMBL

Technical internships at ChEMBL.


We are looking for skilled Computer Science (and related fields) students with strong programming skills to join our team for 3-6 month internships. This is not necessarily a summer internship program, you can start whenever convenient for you after being accepted. Please take a look at some of the research ideas / candidate profiles below:

1. Java programmer -  we are looking for a person with experience in Java to develop a prototype of new KNIME nodes for interacting with the ChEMBL API. Experience with REST and/or KNIME is a plus but not a requirement - you can learn it during your internship. A very important thing to note that you should be excited about UX and creating user-friendly and pragmatic GUIs.

2. C++ programmer - we would like to invite a person passionate about C++ and pattern recognition / image processing to experiment with optimising the open-source OSRA code. OSRA is like OCR but for molecules. We want to make it faster and more accurate.

3. C++ programmer with a graph theory knowledge. Chemical compounds are represented as graphs in-silico. We want to be able to quickly generate random graphs that would also be valid compounds. Experience with distributed computing, computing grids, network file systems and map-reduce is a plus but not required.

4. JavaScript programmer - "any application that can be written in JavaScript, will eventually be written in JavaScript". This is why we are looking for a person with JS experience to experiment with:
  • Creating prototypes of reusable chemical web widgets using polymer.
  • Using emscripten to cross compile some core chemical software written in C++ to JS.
5. A person with a data visualisation skills to explore Kibana and Kibi tools to create beautiful and informative datavis widgets from ChEMBL data.

6. Someone with the Natural Language Processing background to:
  • Create a dictionary of common spelling mistakes in chemistry patents.
  • Create a network of patent relations using textrank algorithm.
  • Explore different approaches to the Named Entity Classification problem.

How to apply?

Just send your CV to kholmes @ with 'ChEMBL Tech Internships' subject.

When to apply?

You can apply anytime but we will only contact selected candidates.

Will all those internships start at the same time?

No, in fact we are planning to select max. 2 most interesting candidates at a given time.

Will I get paid?

The internship is paid 800 GBP per month OR funded by your alma mater (whatever is better for you).

Sunday, 19 March 2017

Finding Compounds in Databases using UniChem

Have you ever identified an interesting compound and wondered what else is known about it?  For example is there any bioactivity data on it in ChEMBL or PubChem?  Is there any toxicity data on it (CompTox)?  Then having found interesting data on a compound wondered if it can be purchased or whether it has been patented.  All this can be done using UniChem.  Interested?  

Come along to our webinar on 29th March at 2pm BST (3pm CEST, 9am EDT)
You will however need to register by emailing chembl-help. Places are limited so please let us know as soon as possible if you register but are then unable to attend.

If you want to know more about UniChem please read on.

UniChem (  is a simple system we have developed to cross-reference compounds across databases both internal to EMBL-EBI and externally. Currently we have cross-references to 140 million compounds in 30 different databases. Information about the sources indexed in UniChem can be found here. UniChem is updated weekly with new compounds from these source databases.

So, for example, you can input a database identifier or an InChIKey into UniChem and see links to all the other indexed databases that have information about that compound.

If we take the drug paroxetine and search for it in UniChem, it is found in 22 databases and the UniChem webpage gives links to the paroxetine entries in those databases.

You don’t have to do this compound by compound using the web interface though.  UniChem has a comprehensive set of  web services that you can use to retrieve data or alternatively all the database files and source to source mapping files are available for download.

UniChem relies on the InChIKey to do the mapping between databases and this works fine if two databases have exactly the same structure for a compound.  We all know however that this isn’t always the case.  Sometimes a different salt or isotope was tested or a mistake was made in the stereocentre assignment meaning the InChIKeys no longer match.

However don’t despair.  UniChem connectivity searching can help.  It turns out that because of the clever way that the InChI is built up with layers, this can be deconstructed and mapping can be done such that the relationship between compounds that differ by stereochemistry, isotopes, protonation state etc can all be identified and mapped. You can do this on single components or mixtures.

Taking our paroxetine example:

We have paroxetine and a number of related compounds in ChEMBL. For example:
Maybe someone wanted to genuinely test these related compounds or maybe they are errors (or a mixture of both).  Whatever the reason by using the UniChem connectivity searching feature we can identify any compounds that match paroxetine on the InChI connectivity layer.
The matches identified from a connectivity search starting with paroxetine can be found here:

At the webinar on 29th March we will describe how this is done in more detail and discuss some use cases.  If you are interested don’t forget to register.

If you want to read more here are links to two papers about UniChem:
Chambers, J., Davies, M., Gaulton, A., Hersey, A., Velankar, S., Petryszak, R., Hastings, J., Bellis, L., McGlinchey, S. and Overington, J.P. 
UniChem: A Unified Chemical Structure Cross-Referencing and Identifier Tracking System.
Journal of Cheminformatics2013, 5:3 (January 2013).

Chambers, J., Davies, M., Gaulton, A., Papadatos, G., Hersey and Overington, J.P.
UniChem: extension of InChI-based compound mapping to salt, connectivity and stereochemistry layers.
Journal of Cheminformatics2014, 6:43 (September 2014)

Tuesday, 14 March 2017

Chemogenomics Analyst Wanted

We are looking to recruit a scientist to support our work for the Horizon 2020 project “Coordinated Research Infrastructures Building Enduring Life-science services” (CORBEL). The role is to facilitate scientists in their use of chemogenomics resources by enabling database searching and evaluation of data.
  • To be responsible for liaising with scientists engaged in CORBEL and advising on the use of chemogenomics resources to progress their projects;
  • To help in the identification and analysis of bioactivity data from multiple database resources;
  • To construct and utilize appropriate workflows to facilitate the pharmacological profiling of molecules and chemotypes, the identification of potential off-target effects and the development of target prediction models;
  • To identify interoperability gaps between resources and help with developing solutions;
  • To organize and run appropriate training courses for scientists engaged in the CORBEL project;

 For full details of the position, or to apply see:

The closing date is 9th April 2017

Monday, 27 February 2017

Position to work on tractability in Open Targets

There is currently an opening for a Protein Computational Scientist to work on methods to assess and quantify the tractability (druggability) of potential new targets for drug discovery. This is a two year position funded by the Open Targets initiative.

The appointee will work with scientists from the Open Targets partners to assess, validate and develop methods for quantifying target tractability with the ultimate goal of incorporating such methodologies into the target validation platform ( The initial focus will be on “small molecule” tractability but we are also interested in other modalities in due course (e.g. antibody therapies). Many of the current methods to assess small molecule tractability are based on the use of 3D protein structures, but such information is only available for a subset of potential targets; a key component of the project is to determine robust methods and pipelines that can be applied to novel targets where there is much more limited information.

For more details or to apply, click here

Closing date is 9th March

(the image above is taken from the Fpocket publication:

Thursday, 9 February 2017

ChEMBL Webinars

We will be running a new series of webinars over the next few months. These will cover a range of topics including basic introductions to the Chemogenomics resources (ChEMBL, SureChEMBL, UniChem) as well as more detailed topics, a schema walkthrough and ChEMBL web services.

The first webinar will be a basic introduction to ChEMBL and will be on 22nd February at 2pm GMT (3pm CET, 9am EST).

If you would like to attend the webinar, please email to register.
Please note, spaces are limited so please let us know as soon as possible if you register but are then unable to attend.

We will post further details of upcoming webinars here, so watch this space!

The ChEMBL Team

Friday, 16 December 2016

Merry Christmas from ChEMBL

Wishing all of our many users and collaborators a very Merry Christmas and a Happy New Year!
The ChEMBL Team

Monday, 5 December 2016

A comprehensive map of molecular drug targets

Within the ChEMBL database we spend a lot of time manually curating links between FDA approved drugs and their efficacy targets. With collaborators from the University of New Mexico and the Institute of Cancer Research, we have now published an analysis of these drug efficacy targets:

Santos R, Ursu O, Gaulton A, Bento AP, Donadi RS, Bologa CG, Karlsson A, Al-Lazikani B, Hersey A, Oprea TI & Overington JP.
A comprehensive map of molecular drug targets
Nature Reviews Drug Discovery (2016) doi:10.1038/nrd.2016.230

In the article we address the complexities of assigning drug targets, describe the 667 human proteins and 189 pathogen proteins through which 1,578 FDA-approved drugs act and map each drug to its therapeutic indication via the WHO ATC classification system.

We show that 70% of small molecule drugs still act through privileged families (GPCRs, ion channels, kinases and nuclear receptors), highlight the differences in innovation between different therapeutic areas, look at conservation of targets across different model organisms and demonstrate that only 5% of identified cancer driver genes are targeted by current cancer therapies.

As an aside, the drug-target data within ChEMBL is used in a number of other platforms such as Pharos (the portal for the NIH Illuminating the Druggable Genome project), Open Targets (a resource for pre-competitive target validation) and DrugCentral (a drug compendium from the University of New Mexico), all of which have papers in the 2017 Database Issue of Nucleic Acids Research, alongside ChEMBL:

Pharos: Collating protein information to shed light on the druggable genome

Open Targets: a platform for therapeutic target identification and validation

DrugCentral: online drug compendium

Tuesday, 29 November 2016

New ChEMBL database paper out

The latest ChEMBL database paper is now available online:

This paper describes some of the additions to ChEMBL over the last few releases (ChEMBL_18 to ChEMBL_22) such as drug indications and clinical candidates, patent bioactivity data from BindingDB, drug metabolism information and richer assay annotation. A number of papers from our collaborators will also feature in the 2017 NAR database issue, so watch this space...

Thursday, 17 November 2016

ChEMBL_22 Data and Web Services Update

ChEMBL_22_1 data update:

We would like to inform users that an update to ChEMBL_22 has been released. 

The new version, ChEMBL_22_1, corrects an issue with the targets assigned to some BindingDB assays in ChEMBL (src_id = 37). If you are using the BindingDB data from ChEMBL, we recommend you download this update. This update also incorporates the mol file/canonical smiles correction announced previously.

Updates have been made to BindingDB data in the ASSAYS, ACTIVITIES, CHEMBL_ID_LOOKUP, LIGAND_EFF and PREDICTED_BINDING_DOMAINS tables. Corrections have also been made to molfiles and canonical_smiles in the COMPOUND_STRUCTURES table. No changes have been made to other data sets or to other drug/compound/target tables in ChEMBL_22.

The new release files can be downloaded from:

A new version of the ChEMBL RDF is also available from:

Improvements to Web Services:

1. Support for SDF format.

The "molecule" endpoint now supports the SDF format. For example, if you access this URL: you will get information about 20 first compounds in JSON format. This URL will return an SDF file of the same molecule page. Please note, that there will be only 18 compounds in SDF output because two compounds from (CHEMBL6961 and CHEMBL6963) have no structure defined. You can easily join the information about the compound provided via JSON, XML or YML format with the structure by inspecting the

> <chembl_id>
sdf property.

Obviously the same format works for a single compound so this URL: will provide an information about Aspirin while this URL (or will return its structure.

The same can be applied to filters, for example this URL returns information about compounds with molecular weight <= 300 AND pref_name ending with nib. The in turn will return corresponding structures.

We also released a new version of Python client (version 0.8.50 available from PyPI and GitHub) that is aware about molfile support. Example code:

from chembl_webresource_client.new_client import new_client
molecules = new_client.molecule
molstring =  molecules.all()[0]

Iterating through all molecules you can get an sdf files with all the structures from chembl, pagination is handled by the client.

2. Structural alerts.

This new API endpoint provides information about compound's structural alerts. For example, on order to get structural alerts for CHEMBL266429, you can use this URL:

Then you can render each of the alerts to image, for example

As you can see, the corresponding fragment is highlighted.You can add all parameters that are present in the standard "image" endpoint so format (png or svg), engine (rdkit or indigo), ignoreCoords to recompute coordinates from scratch and dimensions to change image size.

3. Document terms (keywords)

We used pytextrank package to extract most relevant terms from all document abstracts stored in ChEMBL, along with their significance score against each document (the code we used to perform the extraction is available).

For example, in order to get all the relevant terms for CHEMBL1124199 document, ordered by the significance score descending, you can use this URL:

By parsing the results you can extract (term, score) pairs and multiply the score to get this list:

590 Inverse agonist activity
548 Thien-2-yl analogues
493 Pentylenetetrazole-induced convulsions
490 5'-alkyl group
477 Agonist activity
472 Inverse agonist
449 5-methylthien-3-yl derivative
427 Potent compounds
417 Vivo activity
403 Magnitude higher affinity

you can now use the HTML5 based word cloud and feed the list into this tool providing the following configuration:

  gridSize: Math.round(16 * $('#canvas').width() / 1024),
  drawOutOfBound: true,
  weightFactor: function (size) {
    return Math.pow(size/100.0, 2.3) * $('#canvas').width() / 1024;
  fontFamily: 'Times, serif',
  hover: function(){},
  color: function (word, weight) {
    return (weight > 500) ? '#f02222' : '#c09292';
  rotateRatio: 0.0,
  backgroundColor: '#ffe0e0'

and you will get this wordcloud:

We are planning to add this component to the new document report card.

It may be also interesting to ask about all the documents for a given keyword, for example in order to get all the documents for the "inverse agonist activity" term ordered by score descending, the following URL can be used:

4. Document similarity

As the last endpoint we added "document_similarity". For example to get all documents similar to CHEMBL1122254 document this URL can be used:

The endpoint uses the same protocol we use to generate the "Related Documents" section in the Document Report Card (

The current protocol is fairly simple (measuring overlap in compounds and targets between the two documents) and not very granular (it can be difficult to choose N most relevant documents from the 50 documents that the protocol returns). However, we are currently investigating alternative methods such as topic modelling.

5. Other improvements

There are some minor improvements as well:
 - Molecule endpoint includes three more properties as described in GitHub issue #106.
 - Target endpoint can be filtered by synonym name, in other words you can get a list of targets for a given gene name, for example:
or using a shortcut:
 - Target relation endpoint can be accessed by primary ID as described in GitHub issue #114.
 - parent_chembl_id filter working correctly for the molecule_form endpoint (for example ) as described in GitHub issue #113

The ChEMBL Team

Thursday, 6 October 2016

ChEMBL 22 release - technical notes

The ChEMBL 22 release brings lots of new data. But we also released some new software so if you are interested in technical details please read on.

1. First of all, please note that ChEMBL 22 is the last release where we provide Oracle 9i dumps.
Oracle 9i has been out of support now for at nearly a decade and shouldn't be in use anymore but please let us know if this is a problem. On the other hand, we will do our best to provide Oracle 12c dumps for the next release.

2. If you are using the python API client please upgrade it by running:

[sudo] pip install -U chembl_webresource_client

This will upgrade the client to the latest version which solves some minor bugs and adds an ability to search in document abstracts. It will also create a new cache so you will see new chembl data immediately. Otherwise, you will need to clear your cache manually.

3. New version (2.4.9) of the ChEMBL API has been released as well. This version includes:
 - new endpoints: tissue and target_relation
 - mechanism endpoint contains references now
 - solr index has been added to documents so their abstracts can be searched for example searching  for 'cytocine': api/data/document/search.json?q=cytokine
 - the outdated chemical cartridge used by API (Biovia Direct) has been updated from 6.3 to 2016 Direct. The result is better handling of SMILES string, for example this API call:[O--].[Fe++].OCC1OC(OC2C(CO)OC(OC3C(O)C(CO)OC(OCC4OC(OCC5OC(O)C(O)C(OC6OC(CO)C(O)C(OC7OC(COC8OC(COC9OC(CO)C(O)C(O)C9O)C(O)C(O)C8O)C(O)C(OC8OC(CO)C(O)C(OC9OC(CO)C(O)C(OC%2510OC(COC%2511OC(COC%2512OC(COC%2513OC(COC%2514OC(COC%2515OC(CO)C(O)C(O)C%2515O)C(O)C(OC%2515OC(CO)C(O)C%2515O)C%2514O)C(O)C(O)C%2513O)C(O)C(O)C%2512O)C(O)C(O)C%2511O)C(O)C(OC%2511OC(CO)C(O)C(O)C%2511O)C%2510O)C9O)C8O)C7O)C6O)C5O)C(O)C(O)C4O)C3O)C2O)C(O)C1O/70
works fine now.
 - status endpoint provides API software version as well as ChEMBL release version.
 - there are many smaller bug fixes and improvements.

4. Since our API is maturing we started preparing collection of embedable widgets written in JS/CSS/HTML that you can use on your website/blog/webapplication. This will be a base for our new ChEMBL website. An example widget providing some besic information about a ChEMBL compound can be found below, the code used to embed it is:

<object data="" width="800px" height="350px"></object>

Another example is an assay co-occurance matrix for compounds extracted from a single document. Again the code to embed is:

<object data="" width="800px" height="800px"></object>