Subscribe to our newsletter
Using NLP to Build a Market Intelligence Platform for the Biotech Industry
Today’s chapter of our NLP blog series is written by Andrii Buvailo. Andrii is a co-founder and director at BPT Analytics, and also Editor at BiopharmaTrend.com, responsible for all content, analytics, and product development in the project. He has been writing about research and business trends in the pharmaceutical industry for over four years, mainly focusing on the digital transformation of drug discovery. Before moving to the pharma space, Andrii held a number of executive positions in various hi-tech companies. Prior to his industrial career, he spent years as a practising scientist, having participated in numerous research projects in Belgium, Germany, the United States, and Ukraine. Andrii holds a BSc and an MSc in Inorganic Chemistry, and a PhD in Physical Chemistry from Kyiv National Taras Shevchenko University. Outside his professional career, Andrii is a big fan of travel, chess, and digital drawing.
See more posts in the NLP blog series
Building an initial knowledge base about the pharmaceutical industry
BPT Analytics project started back in 2016 with a simple drug discovery market research blog at BiopharmaTrend.com, where we posted our own regular observations about innovations and technology trends in the pharmaceutical industry, focusing on what companies did to advance the field. At that time we started a systematic effort of collecting data about as many drug discovery and biotech companies and startups as we possibly could. The idea was to create a large enough database of properly labelled companies to see if we would be able to later train machine learning models on it. In 2019 we were awarded a Catalyst Grant to advance our efforts.
Today, we already have a database of more than 7,000 pharma/biotech companies and over 3,000 investors active in the area. The list of companies is matched with numerous other databases and information resources, including clinical trials, marketed drugs, research papers and patents, funding rounds, R&D partnerships, and other aspects important to understand each company’s role and position in the pharmaceutical landscape. We gather data from numerous sources, including our web-parsing engine, collections with external APIs, data supplied by users, and data collected manually.
Importantly, we have built the infrastructure to manually curate the inflowing data by our freelancers. This process allowed us to accelerate and scale up our manual data curation effort. In order for this data to become useful for the pharmaceutical professional and other decision-makers, we are building a subscription-based web-interface BPT Analytics where users can conduct their own market research using our data, with advanced filters and powerful visualization tools. The interface is currently in private beta testing mode for basic functionality.
On the horizon: using NLP to automate ontology construction
While well-organised manual data curation is one way to build a useful market intelligence service for the pharma industry, it is certainly a limited value proposition. For example, our search is limited to exact keyword-based indexing, without any semantic search options. It means that we can only find information using exact terms and parameters. If a document contains a slight variation of the same term, our search will not be able to find that document.
Another limitation is that all labelling has to be done manually for each entity, and all entities have to be manually associated in the database, which is extremely resource-demanding and inefficient. In order to provide a new level of data mining capabilities for our future customers, we are now exploring ways to apply natural language processing (NLP) technologies in our project. Some of the key tasks that we are hoping to solve by implementing NLP models is to be able to automate domain-specific entity recognition – identifying biotech companies, drugs, diseases, therapeutic modalities etc. – out of vast amounts of mostly unstructured data, and grouping them by a number of requirements.
In time this means we will be able to extract relations between the entities and build knowledge graphs; a key component in being able to understand the pharmaceutical R&D market and derive macro- and micro-trends and business insights for the user.
Challenges to overcome
Integrating NLP models into the existing project is a tricky endeavour, and we will need to expand our expertise substantially to achieve this goal. We have unique domain-specific expertise in the life sciences industry and biotech market, a large corpus of quality data to train models. We are now exploring our potential customers’ needs to formulate use cases, and pipeline requirements for the NLP system and its output.