Bioinformatics FAQ Bot

SD Python Monthly Meetup

Hobson Lane, Travis Harper

Dec 19, 2019

Bioinformatics FAQ Bot

San Diego Python Monthly Meetup by Hobson Lane, Travis Harper Oct 24, 2019

Bioinformatics for Real

UCSD Extension Data Science for Digital Health Enroll at: bit.ly/ucsd-ds Discount Code ($100 off): UCSDDSDHWI20

Thank you Travis!

  • Analyzing WikiQA
  • Architecting a Transformer

QA Bot

  • WikiQA
  • ANSQ

SOA

WikiQA State of the Art

Basic Search QA

  • Find a question DB (key)
  • Respond with associated answer (value)

Infinite Search QA

  • Translate question to a statement
  • Search Wikipedia

Example: Q -> A

>>> question = "Who discovered radiation?"
>>> statement = question.replace('Who', '[MASK] [MASK]')
>>> statement
"[MASK] [MASK] discovered radiation."

Search Results

DuckDuckGo: “discovered radiation”

Person that discovered radiation

Answer

Marie Curie

Scalable Search: O(log(N))

  • Discrete index
  • Sparse BOW vectors

Synonyms & Typos

  • Stemming
  • Lemmatizing
  • Spelling Corrector
  • BPE (bytepair encoding)

Examples

  • Full text search in Postgres
  • Trigram indexes in Databases
  • Ellastic Search

Prefilter

  • Page rank
  • Sparse TFIDF vectors

Examples

  • Full Text (keywords): O(log(N))
  • TFIDF (Ellastic Search): O(log(N))
  • TFIDF + Semantic Search: O(L)

Academic Search Approachs

  • Edit distance

Knowledg-based QA

  • Extract information from Wikipedia
  • Build Knowledge Graph in DB
  • Query Knowledge Graph
  • Inference on Knowledge Graph

Transformer

Transformer Test Example Output

Transformer

Transformer Wizard of Oz Question Answers

nlpia-bot imports

from qary.etl import glossaries
from qary import spacy_language_model

nlp = spacy_language_model.load('en_core_web_md')

qary.glossary_bots.Bot

class Bot:

    def __init__(self, domains=('dsdh',)):
        global nlp
        self.nlp = nlp
        self.glossary = glossaries.load(domains=domains)
        self.glossary.fillna('', inplace=True)
        self.vector = dict()
        self.vector['term'] = pd.DataFrame({s: nlp(s or '').vector for s in self.glossary['term']})
        self.vector['definition'] = pd.DataFrame({s: nlp(s or '').vector for s in self.glossary['definition']})

qary.glossary_bots.Bot.reply

    def reply(self, statement):
        """ Suggest responses to a user statement string with [(score, reply_string)..]"""
        responses = []
        match = re.match(r'\b(what\s(is|are))\b([^\?]*)(\?*)', statement.lower())
        if match:
            responses.append((1, str(match.groups())))
        else:
            responses = [(1.0, "I don't understand")]
        return responses

glossary_bots test

>>> from qary.skills import glossary_bots
>>> bot = glossary_bots.Bot()
>>> bot.nlp.lang
'en'
>>> list(bot.vector['term'].keys())
['ACP (American College of Physicians)',
 'AKI (Acute Kidney Injury)',
 'Allele',
 ...
 'Xiaoice']

glossary_bots test

>>> bot.reply('Nucleotide')
[(1.0, "I don't understand")]
>>> bot.reply('What is a Nucleotide')
[(1, "('what is', 'is', ' a nucleotide', '')")]

Now strip whitespace and stop words and look up the definition in bot.glossary.

Or use the semantic vectors…

regex hack

match = re.match(
    r"\b(what\s+(is|are)\s*(not|n't)?\s+(a|an|the))\b([^\?]*)(\?*)",
    statement.lower())
if match:
    try:
        responses.append((1,
            self.glossary['definition'][match.groups()[-2].strip().lower()]))
    except KeyError:
        responses.append((1,
            str(match.groups())))

glossary_bots works!

>>> bot = Bot()
>>> bot.reply('allele')
[(1.0, "I don't understand")]
>>> bot.reply('What is a nucleotide?')
[(1,
  'The basic building blocks of DNA and ... Guanine (G), Cytosine ... ')]