NLP Content Store

Benefits Why use the content store?

350m+ biomedical documents

Ready to text mine to get the insights you need

6m+ biomedical terms

Documents enhanced to include matches against ontologies

24/7 access to content

Accessible on the cloud, all content is updated weekly

Our solution High value content – ready to text mine

Linguamatics content store provides access to the largest set ready-to-text-mine life science, biomedical and healthcare documents.

Answer key questions such as:

What targets are involved in lung cancer?
What companies are patenting a particular technology?
What are the safety risks of my drug?
How can I find the best site for my clinical trials?

Get in touch with a Linguamatics NLP expert

What content is available?

Our store is constantly being updated to add valuable content ready to mine. We can also add custom content if you have the required licenses (e.g. Embase, Copyright Clearance Center).

PubMed

PubMed is an excellent source of biomedical research knowledge, covering decades of published articles from academic journals covering biochemistry, medicine, nursing, pharmacy, dentistry, veterinary medicine, and health care.

PubMed Central Open Subset

PubMed Central provides a valuable source of biomedical research knowledge; in particular, access to the full-text papers can facilitate extraction of specific methods, assays, or details of healthcare costs, patient outcomes, and other in-depth information.

Insightmeme

The collaboration with Insightmeme and Pharmaspectra grants access to an industry-leading scientific conference database. With nearly 2 million abstracts uploaded annually, 55% of which are from major conferences and available before the event, users can stay ahead of the science and identify new KOLs and rising stars before the competition.

Preprints - bioRxiv

bioRxiv is a free online archive and distribution service for unpublished preprints in the life sciences. By posting preprints on bioRxiv, authors are able to make their findings immediately available to the scientific community and receive feedback on draft manuscripts before they are submitted to journals. By posting on bioRxiv, authors explicitly consent to text mining of their work.

Preprints - medRxiv

medRxiv is a free online archive and distribution service for complete but unpublished preprints in medical, clinical and related health sciences. medRxiv provides free and unrestricted access to all the articles posted on the server for both human readers and machine analysis.

Drug Labels

Easy access to reference drug label sources allows pharmaceutical regulatory, safety and medical affairs teams to quickly verify critical information about drug indications, dosages, and safety warnings. This ensures that they can provide accurate and timely responses to stakeholders (including healthcare professionals, patients, and regulatory bodies), enhancing patient safety and compliance. Additionally, it streamlines the process of updating and maintaining drug information, which is crucial for effective communication and decision-making within the industry. The Linguamatics Content Store provides access to a suite of drug labels from key reference regulatory bodies, including FDA, EMA, UK, France, Canada and Spain.

FDA Adverse Events Reporting System (FAERS)

As well as the standard ontologies, the FAERS index includes its own domain-specific ontologies containing classes which can be used to filter or display FAERS documents by the contents of their structured fields (sections within the documents).

ClinicalTrials.gov

Access to Linguamatics ClinicalTrials.gov for text mining enables researchers to assess clinical trial inclusion/exclusion criteria for patient selection, trial site evaluation and study design as well as to discover competitive intelligence around companies, diseases, targets and novel drugs.

Patent - Abstracts

As well as the standard ontologies, the Patents index includes its own domain-specific ontologies providing concepts specific to patent classification using Cooperative Patent Classification (CPC).

Patent - Full Text

The Patent Solution allows users to generate powerful and bespoke queries for patent search and analysis, for patent landscapes, white space analyses, freedom-to-operate searches, research methodologies, competitive intelligence and state-of-the-art reviews for confident decision making.

NIH Grants

Access to NIH grants can facilitate the development of new collaborations, and provide information on most recent research challenges, through the rapid discovery and recommendation of researchers, key opinion leaders, current expertise, and resources.

Gene Expression Omnibus (GEO)

The Content Store GEO index contains enriched versions of each series with ontology mapping providing the ability to search for genes, organisms, chemicals, numerical information, etc. via synonyms, common names, etc.

Enriched with Proprietary and standard Ontologies

All Linguamatics content sources are indexed with the Linguamatics standard set of domain-specific ontologies, for enriched semantic searches. Find details of all our ontologies below.

Biomedical Terminologies

Linguamatics biomedical terminologies enable identification, extraction and normalization of over a million concepts, covering a wide variety of life science domains: diseases, genes, proteins, biomarkers, gene variants & mutations, phenotypes, drugs, adverse events, biological processes, organs, tissues and cells.

Healthcare Terminologies

Healthcare terminologies are integrated into Linguamatics platform covering key medical domains and categories. These are recognized using a combination of standard ontologies, pattern-based approaches and linguistic rules to enable the context around any patient variable to be taken into account (e.g. a family's history of disease). They are often used alongside the biomedical terminologies to maximize the amount of information that can be extracted from medical records.

Healthcare terminologies are valuable for identifying key patient data from a variety of medical records, including patient problem lists, disease history and vital signs (blood pressure, heart rate, pulse, respiratory rate, temperature, gender and age). Lifestyle factors such as smoking, drug use, alcohol consumption, exercise, diet and sexual activity can also be analyzed.

Chemical Entities

Chemical entities can be found using ChEBI, MeSH and the NCI Thesaurus. In addition, the Linguamatics ChemAxon add-on identifies known and novel chemical structures within documents: by name, structure, substructure or similarity.

Drug Terminologies

Linguamatics provides a pattern ontology that enables the identification and extraction of many different pharmaceutical company chemical identifiers (such as LY-170053, SQ 34676, ICI 204, 219).

Numerical Data

Linguamatics provides pattern ontologies that identify numerical data, such as times, dates, numerics, and units of measurement. These allow for the identification of concepts that can be expressed in many ways, extend search by annotating novel textual descriptions of key concepts or concept types and normalize results to greatly simplify downstream analysis.

Organizations and People

Information on organizations can be extracted and categorized by sector, type and geographic location. Searching by sector allows named pharmaceutical companies, universities or government agencies to be extracted. Organization types are also available, using linguistic rules and patterns to automatically detect whether an entity is a corporation, division, hospital or institute. Organizations can also be identified by geographical location (region, country, state or city). In addition, pattern ontologies allow for the identification of telephone numbers, names of people, and email addresses.

Bespoke Vocabularies

Linguamatics supports bespoke or custom vocabularies. These can be imported from academic or commercial sources. In-house vocabularies can also be employed, for example: a dictionary of employees from an organizational chart, or a controlled vocabulary for an internal drug development project.

Source-specific Dictionaries

Linguamatics incorporates data from the sources in the Content Store to provide source-specific dictionaries. These include Patent classification codes, listings of product names in FDA Drug Labels and specific FAERS terms.