Well, we finally did it. Readers of this blog may recall us announcing our intent to deposit our complete collection of patent chemistry into PubChem. Today we announced there are now more than 8 million SureChem structures available in PubChem, covering US, EP and WO patents, and Japan patent abstracts, from 1976 to present. More than 4 million of these are unique to SureChem as a source, meaning a big jump in the amount of publicly available patent chemistry.
While we have only deposited the structure data into PubChem, users can view related patents for free at SureChemOpen. Links bring you to a chemical landing page that gives info about the structure and its occurrence in the SureChem database. Registered users can go on to view patents containing those structures or perform related searches. Our deposition marks the first time anyone has ever made a complete patent chemistry collection publicly available. This is a key step in our effort to integrate patent chemistry into the online research community and make it more accessible to wider range of users.
Some may notice that while the SureChem database contains nearly 13 million unique structures, we only deposited 8.2 million into PubChem. This is because we biased the upload towards medicinal chemistry by filtering out common chemistry, structures >900MW, reactive groups and fragments.
A few interesting facts:
- With this deposition PubChem has reached 46 million structures (so the big "50" can't be far away...)
- There are now more than 14 million structures in PubChem from patent sources
- About half our submissions have been merged with preexisting CIDs from other sources - this independent corroboration of 4 million structures is an indicator of the quality of the SureChem chemistry mining pipeline.
- The SureChem deposition is strongly weighted toward the molecular weight range of 300-600, making the distribution of 'drug-like' compounds similar to PubChem data from ChEMBL and Thomson Reuters.
For more useful analysis, check out this post at the blog of Chris Southan, chemical information guru and a member of the SureChem Advisory Board.
We will be initially updating this data manually but transitioning to an automated schedule in 2013, which will keep the data in PubChem synched with that in SureChem (more details to come).
We are constantly improving the quality of our dataset, optimising our pipeline to improve our rate of chemical name entity recognition and name-to-structure conversion. We are also still processing our backfile of structures extracted from images as well as the USPTO complex work units, which we expect to complete in the first half of 2013. Because of this, we will likely make several re-deposits of the entire SureChem dataset. So you can expect both total numbers and those of novel compounds to continue increasing.
We are very excited about this latest chapter in the SureChem story. I've been working for years toward this goal, and thanks to the excellent team here at SureChem it's happened. We hope that researchers around the world make use of this data, and that it helps people realise that patents are a key research resource that anyone can now access.