GATE & PYTHON¶


In this tutorial, we will see various ways of interacting with Java GATE from Python.

  • Python modules are still experimental! If you find any bugs please report it!
  • For PythonDoc, and more, you can visit https://gatenlp.github.io/python-gatenlp.
  • Mainly, Jupyter notebook will be used for this presentation. ######
    Mehmet E. Bakir

GATE and PYTHON¶


Currently, GATE and Python can, somehow, interact in three ways!

  1. Python GateNLP : A Python package for NLP tasks. It is similar to Java GATE.
  2. GATE Slave : A module that enables running almost anything in Java GATE from Python.
  3. GATE Python Plugin : It is GATE a Processing Resource that allows the editing and running python code within the Java GATE.

Part-1 Python GateNLP¶


An NLP framework written in pure Python. Its documents and annotations representations are very similar to the Java GATE framework.

  • A document can have any number of named annotation sets and features.
  • An annotation set can have any number of annotations.
  • Annotations are metadata associated with a span of a document. Each annotation can have any number of features.

Python GateNLP and Java GATE exchange documents in bdocjs/bdocym/bdocmp formats, via Java GATE Format Bdoc Plugin.

Get Your Hands Dirty!¶

Dirty Hands

Python GateNLP - Installation first!¶

Main documentation page: https://gatenlp.github.io/python-gatenlp/installation.html¶

Environment Setup¶


  • First make sure you have Python, We recommend installing Python via Conda with one of:
    • Anaconda (https://www.anaconda.com/products/individual), or,
    • Miniconda (https://docs.conda.io/en/latest/miniconda.html)
  • Then, create a virtual environment. The main purpose of virtual environments is to create an isolated environment for your projects. This means that each project can have its own dependencies.
    • Create an environment, named "gatenlp", with the latest python version: conda create -n gatenlp python=3.9
    • And activate this environment, use: conda activate gatenlp

Installation Cont.¶

Install from GitHub¶


To install the latest released version of gatenlp package with its all dependencies: python -m pip install gatenlp[all]

The RECOMMENDED way (at least for now!). Install the latest gatenlp code with all dependencies from GitHub :

  1. Goto the GitHub page https://github.com/GateNLP/python-gatenlp.
  2. Clone the repository to a local directory and change directory into python-gatenlp.
  3. Run python -m pip install -e .[all]

Installation Cont.¶

Run with Jupyter¶


Requirements for running gatenlp in a Jupyter notebook (they should already have been installed--if installed from GitHub):

  • ipython
  • jupyter
  • ipykernel

To create a kernel for your conda environment run:

python -m ipykernel install --user --name gatenlp --display-name "Python gatenlp"

To use an existing notebook (for example this notebook), run the following and change the kernel to gatenlp environment:

jupyter notebook notebookname.ipynb

Let's Code!¶

import interested classes from gatenlp

In [1]:
from gatenlp import Document

Now load a plain text document!

  • For loading a document use load method of Document class, by which you can load local or remote files.
In [2]:
# You can just add some text to the document 
# text = """Some text forms the document content"""
# doc = Document(text)
# OR load from a local file
doc = Document.load('./data/document-testing.txt')
# OR
# load a remove file with the same method
# i.e., doc = Document.load("https://gatenlp.github.io/python-gatenlp/testdocument1.txt")

Print the document content

In [3]:
print(doc)
Document(This is a test document.

It contains just a few sentences. 
Here is a sentence that mentions a few named entities like 
the persons Barack Obama or Ursula von der Leyen, locations
like New York City, Vienna or Beijing or companies like 
Google, UniCredit or Huawei. And here is Donald Trump, it may not be the real one :P

Lets say Boris Johnson aka Bojo tweets from his BorisJohnson account, would be nice to match them!

Here we include a URL https://gatenlp.github.io/python-gatenlp/ 
and a fake email address john.doe@hiscoolserver.com as well 
as #some #cool #hastags and a bunch of emojis like 😽 (a kissing cat),
👩‍🏫 (a woman teacher), 🧬 (DNA), 
🧗 (a person climbing), 

Here we test a few different scripts, e.g. Hangul 한글 or 
simplified Hanzi 汉字 or Farsi فارسی and Arabic ,اَلْعَرَبِيَّةُ, which goes from right to left.

,features=Features({}),anns=[])

Printing the document shows the document text and indicates that there are no document features and no annotations! We expect this since, we just loaded a plain text file.

In the Jupyter notebook/lab, a gatenlp document can also be displayed graphically by either;

  1. by passing it to the IPython display function, or,
  2. calling it at the end of a cell.

Here we pass the doc to the display method!¶

In [4]:
from IPython.display import display
display(doc)
We got the same view by calling doc at the end of the cell (it can be the only code in the cell).¶
In [5]:
doc
Out[5]:

There are three areas in this layout:

  • Upper-left contains the document text,
  • Upper-right lists the annotation sets and annotation types,
  • Bottom bar lists the document or annotation features. Note: The layout currently shows only the doc's text, since we have not added any features to the document or annotations, yet!

Document features¶

Let's add some document features.

In [6]:
doc.features["loaded_from"] = "Local file system."
import datetime
doc.features["loading_date"] = str(datetime.datetime.now())
doc.features["purpose"] = "Testing gatenlp."
doc.features["numeric_value"] = 22
doc.features["dict_of_objects"] = {"dict_key": "dict_value", "a_list": [1,2,3,4,5]}

The document features maps feature keys to feature values. It behaves like Python dictionaries. The keys have to be string, the values can be anything as long as they can be serialized with JSON, e.g., dictionaries, lists, numbers, strings and booleans. (This is needed for exchanging documents with Java GATE)

Now view doc with features!

In [7]:
doc
Out[7]:

Retrieve a feature value¶

In [8]:
doc.features["purpose"]
Out[8]:
'Testing gatenlp.'

If a feature does not exist, None is returned, or a default value, if specified:¶

In [9]:
print(doc.features.get("doesntexist"))
print(doc.features.get("doesntexist", "NA!"))
None
NA!

Annotations & Annotation Sets¶


Annotations cover a range of characters within the document. Annotations can overlap arbitrarily and can be created as many as needed.

Annotations consist of the following parts:

  • The start and end offset which is used for associating the text that the annotation covers,
  • A type which is an arbitrary name that identifies what kind of thing the annotation describes, e.g. "Token", "Person",
  • Features; these are similar to document features, i.e., an arbitrary set of feature key/value pairs which provide more information about the annotation, e.g. a Token annotation can include lemma, POS tags, orth. features etc...

Annotations are be organized in annotation sets. Each annotation set:¶

  • Holds a collection of Annotations.
  • Usually, has a name. Only the defaut one does not have one!
  • They can be created as many as needed.

Let us manually add a few annotations to the document!¶

Firstly, create an annotation set with the annset method¶

In [10]:
# create and get an annotation set with the name "Set1"
annset = doc.annset("Set1")

Add an annotation to the set which refers to the first word in the document "This". The range of characters for this word starts at offset 0 and the length of the annotation is 4, so the "start" offset is 0 and the "end" offset is 0+4=4. End offset always points to the offset after the last character of the range.

In [11]:
#Now, add an annotation
annset.add(0,4,"Annot1",{"feature1_key": "feature1_value"})
Out[11]:
Annotation(0,4,Annot1,features=Features({'feature1_key': 'feature1_value'}),id=0)

Add more¶

In [12]:
# Add some more!
annset.add(0,4,"Token",{"kind": "token1'"})
annset.add(5,7,"Token",{"kind": "token2'"})
annset.add(8,9,"Token",{"kind": "token3'"})
annset.add(10,14,"Token",{"kind": "token4'"})
annset.add(15,24,"Token",{"kind": "token5"})
annset.add(0,24,"Sentence",{"what": "The first 'sentence' annotation"});
In [14]:
# Now, Visualise the document.
doc
Out[14]:
  • To show the features of different annotations click on the coloured text in the document.
  • To show the document features, click on the "Document" title at the top.
  • Clicking the overlapping annotations will open a dialog which allows selecting the features of the overlapped annotations.

Gazetteers¶


Reminder: Gazetteers allow matching document text against lists of words.

Firstly, Tokenize the Document¶

In [15]:
# tokenize with NLTK tokenizer (doesn't create space token by default)
from gatenlp.processing.tokenizer import NLTKTokenizer

Pass NLTKWordTokenizer object to NLTKTokenizer, and run the tokenizer.¶

see the PyToken annotation

In [16]:
# Tokenize the document, lets use an NLTK tokenizer
from nltk.tokenize.destructive import NLTKWordTokenizer
nltk_tokenizer = NLTKTokenizer(nltk_tokenizer=NLTKWordTokenizer(), out_set="", token_type="PyToken")
doc = nltk_tokenizer(doc)
doc
Out[16]:

Now we can use the gazetteer module.¶

gatenlp.processing.gazetteer module provides Gazetteer classes.

In [17]:
from gatenlp.processing.gazetteer import TokenGazetteer

Let's use an existing Java Gate Gazetteers first¶

The mp_lists.def available in the gate-hate app

In [18]:
# you may need to modify this path
gazFile = "./../../gate-hate-cloud-small/hate-resources/gazetteer/politics/mp_lists.def"
In [19]:
gazetteer = TokenGazetteer(gazFile, fmt="gate-def", all=True, skip=False, outset="", tokentype="PyToken");
doc = gazetteer(doc);
doc
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-19-fe41e7ae2d19> in <module>
----> 1 gazetteer = TokenGazetteer(gazFile, fmt="gate-def", all=True, skip=False, outset="", tokentype="PyToken");
      2 doc = gazetteer(doc);
      3 doc

/data/johann/work-git/python-gatenlp/gatenlp/processing/gazetteer.py in __init__(self, source, fmt, source_sep, source_encoding, cache_source, tokenizer, all, skip, outset, outtype, annset, tokentype, feature, septype, splittype, withintype, mapfunc, ignorefunc, getterfunc, listfeatures, listtype)
    170         self.logger = init_logger(__name__)
    171         self.logger.setLevel(logging.DEBUG)
--> 172         self.append(source, fmt=fmt, listfeatures=listfeatures, listtype=listtype)
    173 
    174     def append(

/data/johann/work-git/python-gatenlp/gatenlp/processing/gazetteer.py in append(self, source, fmt, source_sep, source_encoding, listfeatures, listtype)
    218             if listtype is None:
    219                 listtype = self.outtype
--> 220             with open(source, "rt", encoding=source_encoding) as infp:
    221                 for line in infp:
    222                     line = line.rstrip("\n\r")

FileNotFoundError: [Errno 2] No such file or directory: './../../gate-hate-cloud-small/hate-resources/gazetteer/politics/mp_lists.def'

Display the doc¶

Check newly created annotations

In [ ]:
doc = gazetteer(doc);
doc

Or, we can create new gazetteers from Python lists¶

In [ ]:
# Create a gazetteer list in a custom format
gazlist = [
    ("Barack Obama", dict(url="https://en.wikipedia.org/wiki/Barack_Obama", kind="name_surname")),
    ("Obama", dict(url="https://en.wikipedia.org/wiki/Barack_Obama", kind="surname")),
    ("Donald Trump", dict(url="https://en.wikipedia.org/wiki/Donald_Trump",kind="name_surname"))
]
In [ ]:
# Tokenize the strings from our gazetteer list as well
def text2tokenstrings(text):
    tmpdoc = Document(text)
    tmpdoc = nltk_tokenizer(tmpdoc)
    tokens = list(tmpdoc.annset().with_type("PyToken"))
    return [tmpdoc[tok] for tok in tokens]

gazlist = [(text2tokenstrings(txt), feats) for txt, feats in gazlist]
gazlist

Ready to apply the python list as gazetteer to the document¶

In [ ]:
gazetteer = TokenGazetteer(gazlist, fmt="gazlist", all=True, skip=False, outset="", outtype="PyLookup",
                          annset="", tokentype="PyToken")

Display the doc¶

Check PyLookup annotation

In [ ]:
doc = gazetteer(doc)
doc

Loading and saving using various document¶

Documents can also be convert to and from a Python-only representation using the methods doc.to_dict() and Document.from_dict(thedict) which can be used to convert the document for other tools.

In [ ]:
# Convert the document to a dictionary representation:
as_dict = doc.to_dict()
as_dict
In [ ]:
# Get a copy by creating a new Document from the dictionary representation
doc_copy = Document.from_dict(as_dict)
doc_copy

Store the document for Java GATE use¶

In [ ]:
# Save the document in bdocjs format
doc.save("./data/docPy2Java.bdocjs")

Exhanging annotated files between Python and Java Gate NLP¶

  1. Now start Java Gate
    • Load BDoc plugin
  2. Load the docPy2Java.bdocjs
  3. View the annotations
  4. Add new annotations, you can run Annie (do not forget disable Document Reset)
  5. Store the modified document, name it to docJava2Py.bdocjs

Load the docJava2Py.bdocjs document into Python GateNLP¶

In [ ]:
jDoc = Document.load("./data/docJava2Py.bdocjs")
In [ ]:
jDoc

GATE and PYTHON¶


Currently, GATE and Python can, somehow, interact in three ways!

  1. Python GateNLP : COMPLETED!
  2. GATE Slave : A module that enables running almost anything in Java GATE from Python.
  3. GATE Python Plugin : It is GATE a Processing Resource that allows the editing and running python code within the Java GATE.

Part-2 GATE Slave¶

  • Allows running the Java GATE process from Python
  1. Java and Python interact via a socket connection.
  2. The Python side sends requests to Java, Java executes the request via Gate NLP then sends the results back to Python.

Dirty Hands

import gateslave module, also import json for loading tweet files.¶

In [ ]:
import json
from gatenlp.gateslave import GateSlaveAnnotator

Create a corpus¶

In [ ]:
corpus = []
with open("./data/tweetGateHate2GateSlave.jsonl", "rt") as infp:
    for line in infp:
        tweet = json.loads(line)
        text=tweet["text"]
        if "full_text" in tweet:
            text= tweet["full_text"]
        doc = Document(text)
        for fname in ["id", "lang", "reply_count", "retweet_count", "quoted_status_id"]:
            doc.features[fname] = tweet[fname]
        corpus.append(doc)
print("Corpus created, number of documents:", len(corpus))

Show a loaded document, not annotated!¶

In [ ]:
corpus[1]

Create a Gate slave object!¶

In [ ]:
gs_app = GateSlaveAnnotator(pipeline="./../../gate-hate-cloud-small/application.xgapp", 
                           gatehome="/home/memin/gate_developer/GATE_Developer_8.6.1")
  1. pipeline parameter should point the location of a Gate application
  2. gatehome points the Gate installation folder
In [ ]:
# Send documents to the Java Gate
In [ ]:
gs_app.start()
for idx, doc in enumerate(corpus):
    doc = gs_app(doc)
    corpus[idx] = doc
gs_app.finish()

Display the annotated documents¶

In [ ]:
corpus[0]
corpus[1]
# corpus[2]

To dig into greater details;¶

visit https://gatenlp.github.io/python-gatenlp/gateslave

GATE and PYTHON¶


Currently, GATE and Python can, somehow, interact in three ways!

  1. Python GateNLP : COMPLETED!
  2. GATE Slave : COMPLETED!
  3. GATE Python Plugin : It is GATE a Processing Resource that allows the editing and running python code within the Java GATE.

Part-3 GATE Python Plugin¶

  1. GATE Python Plugin is a GATE(Java) processing resource,
  2. It is called PythonPr which allows the editing and running of python code within the GATE.
  3. The Python API for processing documents is the Python gatenlp package.
    • The plugin provides its own copy of a specific version of the gatenlp package which is used by default, but it is possible to instead use whatever version of the gatenlp package is installed on the system.

For more: http://gatenlp.github.io/gateplugin-Python/PythonPr

Requirements:¶

Install a Python version 3.6 or later (3.7 or later highly recommended!), and sortedcontainerspackage.

If you haven’t created yet!

  • First create an environment: conda create -n gatenlp python=3.9
  • To activate it, run: conda activate gatenlp
  • Then install sortedcontainers: conda install -c conda-forge sortedcontainers

Loading Python Plugin to the Gate GUI (Gate Developer)¶

  • Requires GATE 8.6.1 or later
  • In the CREOLE plugin manager, click the “+” button then enter the following Maven coordinates
    • Group: uk.ac.gate.plugins
    • Artifact: python
    • Version: 2.4-SNAPSHOT

Now Run the following codes in the Java GATE¶

  1. Create an empty python file somewhere.
  2. Try the codes written below. ### PythonPr Init Parameters Parameters that have to get set when the processing resource is created: pythonProgram (ResourceReference, default: empty): locate the python source code that you want to run. Use the file selection dialog.

When a pipeline that contains the PythonPr processing resource is run, the following main steps are involved:

  • The Python program runs in a separate process.
  • The Python program must:
    • implement a function or callable class that uses the @gatenlp.GateNlpPr decorator
    • invoke the gatenlp.interact() function (see examples below)
  • The processing resource sends each document to the Python program
  • The implemented @GateNlpPr function or the __call__ method of the implemented @GateNlpPr class is invoked and the document is passed to that function.
  • The function can use the gatenlp API to modify the document. All the changes are recorded.
  • The recorded changes are sent back to the PythonPr which applies the changes to the documents in Java GATE.

Here is a simple example Python program which splits the document into white-space separated tokens using a simple regular expression and creates an annotation with the type “Token” in the default annotation set for each token.

import re from gatenlp import GateNlpPr, interact

@GateNlpPr def run(doc, *kwargs): set1 = doc.annset() set1.clear() text = doc.text whitespaces = [m for m in re.finditer(r"[\s,.!?]+|^[\s,.!?]|[\s,.!?]*$", text)] for k in range(len(whitespaces) - 1): fromoff = whitespaces[k].end() tooff = whitespaces[k + 1].start() set1.add(fromoff, tooff, "Token", {"tokennr": k}) doc.features["nr_tokens"] = len(whitespaces) - 1

interact() # must be called </code>

Instead of a function, a callable class can be implemented with the @GateNlpPr decorator.

  • The class must implement the __call__ method, but in addition can also implement the start, finish, reduce and result methods.
  • The following example implements the same tokenizer as above in a class but also counts and prints out the total number of tokens over all documents. Again the interact() call must be placed at the end of the Python script.

import re from gatenlp import GateNlpPr, interact, logger

@GateNlpPr class MyProcessor: def init(self): self.tokens_total = 0

def start(self, **kwargs):
    self.tokens_total = 0

def finish(self, **kwargs):
    logger.info("Total number of tokens: {}".format(self.tokens_total))

def __call__(self, doc, **kwargs):
    set1 = doc.annset()
    set1.clear()
    text = doc.text
    whitespaces = [m for m in re.finditer(r"[\s,.!?]+|^[\s,.!?]*|[\s,.!?]*$", text)]
    nrtokens = len(whitespaces) - 1
    for k in range(nrtokens):
        fromoff = whitespaces[k].end()
        tooff = whitespaces[k + 1].start()
        set1.add(fromoff, tooff, "Token", {"tokennr": k})
    doc.features["nr_tokens"] = nrtokens
    self.tokens_total += nrtokens

interact() </code>

GATE and PYTHON¶


Currently, GATE and Python can, somehow, interact in three ways!

  1. Python GateNLP : COMPLETED!
  2. GATE Slave : COMPLETED!
  3. GATE Python Plugin : COMPLETED!

THANK YOU¶

Now you can dig into https://gatenlp.github.io/python-gatenlp/ to learn more!