In this tutorial, we will see various ways of interacting with Java GATE from Python.
Currently, GATE and Python can, somehow, interact in three ways!
An NLP framework written in pure Python. Its documents and annotations representations are very similar to the Java GATE framework.
Python GateNLP and Java GATE exchange documents in bdocjs/bdocym/bdocmp formats, via Java GATE Format Bdoc Plugin.
conda create -n gatenlp python=3.9
conda activate gatenlp
To install the latest released version of gatenlp package with its all dependencies: python -m pip install gatenlp[all]
The RECOMMENDED way (at least for now!). Install the latest gatenlp code with all dependencies from GitHub :
python -m pip install -e .[all]
Requirements for running gatenlp in a Jupyter notebook (they should already have been installed--if installed from GitHub):
To create a kernel for your conda environment run:
python -m ipykernel install --user --name gatenlp --display-name "Python gatenlp"
To use an existing notebook (for example this notebook), run the following and change the kernel to gatenlp environment:
jupyter notebook notebookname.ipynb
import interested classes from gatenlp
from gatenlp import Document
Now load a plain text document!
load
method of Document
class, by which you can load local or remote files.# You can just add some text to the document
# text = """Some text forms the document content"""
# doc = Document(text)
# OR load from a local file
doc = Document.load('./data/document-testing.txt')
# OR
# load a remove file with the same method
# i.e., doc = Document.load("https://gatenlp.github.io/python-gatenlp/testdocument1.txt")
Print the document content
print(doc)
Document(This is a test document. It contains just a few sentences. Here is a sentence that mentions a few named entities like the persons Barack Obama or Ursula von der Leyen, locations like New York City, Vienna or Beijing or companies like Google, UniCredit or Huawei. And here is Donald Trump, it may not be the real one :P Lets say Boris Johnson aka Bojo tweets from his BorisJohnson account, would be nice to match them! Here we include a URL https://gatenlp.github.io/python-gatenlp/ and a fake email address john.doe@hiscoolserver.com as well as #some #cool #hastags and a bunch of emojis like 😽 (a kissing cat), 👩🏫 (a woman teacher), 🧬 (DNA), 🧗 (a person climbing), Here we test a few different scripts, e.g. Hangul 한글 or simplified Hanzi 汉字 or Farsi فارسی and Arabic ,اَلْعَرَبِيَّةُ, which goes from right to left. ,features=Features({}),anns=[])
Printing the document shows the document text and indicates that there are no document features and no annotations! We expect this since, we just loaded a plain text file.
In the Jupyter notebook/lab, a gatenlp document can also be displayed graphically by either;
display
function, or,display
method!¶from IPython.display import display
display(doc)
doc
at the end of the cell (it can be the only code in the cell).¶doc
There are three areas in this layout:
doc
's text, since we have not added any features to the document or annotations, yet!Let's add some document features.
doc.features["loaded_from"] = "Local file system."
import datetime
doc.features["loading_date"] = str(datetime.datetime.now())
doc.features["purpose"] = "Testing gatenlp."
doc.features["numeric_value"] = 22
doc.features["dict_of_objects"] = {"dict_key": "dict_value", "a_list": [1,2,3,4,5]}
The document features
maps feature keys to feature values. It behaves like Python dictionaries. The keys have to be string, the values can be anything as long as they can be serialized with JSON, e.g., dictionaries, lists, numbers, strings and booleans. (This is needed for exchanging documents with Java GATE)
Now view doc
with features!
doc
doc.features["purpose"]
'Testing gatenlp.'
print(doc.features.get("doesntexist"))
print(doc.features.get("doesntexist", "NA!"))
None NA!
Annotations cover a range of characters within the document. Annotations can overlap arbitrarily and can be created as many as needed.
Annotations consist of the following parts:
annset
method¶# create and get an annotation set with the name "Set1"
annset = doc.annset("Set1")
Add an annotation to the set which refers to the first word in the document "This". The range of characters for this word starts at offset 0 and the length of the annotation is 4, so the "start" offset is 0 and the "end" offset is 0+4=4. End offset always points to the offset after the last character of the range.
#Now, add an annotation
annset.add(0,4,"Annot1",{"feature1_key": "feature1_value"})
Annotation(0,4,Annot1,features=Features({'feature1_key': 'feature1_value'}),id=0)
# Add some more!
annset.add(0,4,"Token",{"kind": "token1'"})
annset.add(5,7,"Token",{"kind": "token2'"})
annset.add(8,9,"Token",{"kind": "token3'"})
annset.add(10,14,"Token",{"kind": "token4'"})
annset.add(15,24,"Token",{"kind": "token5"})
annset.add(0,24,"Sentence",{"what": "The first 'sentence' annotation"});
# Now, Visualise the document.
doc
# tokenize with NLTK tokenizer (doesn't create space token by default)
from gatenlp.processing.tokenizer import NLTKTokenizer
see the PyToken annotation
# Tokenize the document, lets use an NLTK tokenizer
from nltk.tokenize.destructive import NLTKWordTokenizer
nltk_tokenizer = NLTKTokenizer(nltk_tokenizer=NLTKWordTokenizer(), out_set="", token_type="PyToken")
doc = nltk_tokenizer(doc)
doc
gatenlp.processing.gazetteer
module provides Gazetteer classes.
from gatenlp.processing.gazetteer import TokenGazetteer
The mp_lists.def
available in the gate-hate app
# you may need to modify this path
gazFile = "./../../gate-hate-cloud-small/hate-resources/gazetteer/politics/mp_lists.def"
gazetteer = TokenGazetteer(gazFile, fmt="gate-def", all=True, skip=False, outset="", tokentype="PyToken");
doc = gazetteer(doc);
doc
--------------------------------------------------------------------------- FileNotFoundError Traceback (most recent call last) <ipython-input-19-fe41e7ae2d19> in <module> ----> 1 gazetteer = TokenGazetteer(gazFile, fmt="gate-def", all=True, skip=False, outset="", tokentype="PyToken"); 2 doc = gazetteer(doc); 3 doc /data/johann/work-git/python-gatenlp/gatenlp/processing/gazetteer.py in __init__(self, source, fmt, source_sep, source_encoding, cache_source, tokenizer, all, skip, outset, outtype, annset, tokentype, feature, septype, splittype, withintype, mapfunc, ignorefunc, getterfunc, listfeatures, listtype) 170 self.logger = init_logger(__name__) 171 self.logger.setLevel(logging.DEBUG) --> 172 self.append(source, fmt=fmt, listfeatures=listfeatures, listtype=listtype) 173 174 def append( /data/johann/work-git/python-gatenlp/gatenlp/processing/gazetteer.py in append(self, source, fmt, source_sep, source_encoding, listfeatures, listtype) 218 if listtype is None: 219 listtype = self.outtype --> 220 with open(source, "rt", encoding=source_encoding) as infp: 221 for line in infp: 222 line = line.rstrip("\n\r") FileNotFoundError: [Errno 2] No such file or directory: './../../gate-hate-cloud-small/hate-resources/gazetteer/politics/mp_lists.def'
Check newly created annotations
doc = gazetteer(doc);
doc
# Create a gazetteer list in a custom format
gazlist = [
("Barack Obama", dict(url="https://en.wikipedia.org/wiki/Barack_Obama", kind="name_surname")),
("Obama", dict(url="https://en.wikipedia.org/wiki/Barack_Obama", kind="surname")),
("Donald Trump", dict(url="https://en.wikipedia.org/wiki/Donald_Trump",kind="name_surname"))
]
# Tokenize the strings from our gazetteer list as well
def text2tokenstrings(text):
tmpdoc = Document(text)
tmpdoc = nltk_tokenizer(tmpdoc)
tokens = list(tmpdoc.annset().with_type("PyToken"))
return [tmpdoc[tok] for tok in tokens]
gazlist = [(text2tokenstrings(txt), feats) for txt, feats in gazlist]
gazlist
gazetteer = TokenGazetteer(gazlist, fmt="gazlist", all=True, skip=False, outset="", outtype="PyLookup",
annset="", tokentype="PyToken")
Check PyLookup annotation
doc = gazetteer(doc)
doc
Documents can also be convert to and from a Python-only representation using the methods doc.to_dict()
and Document.from_dict(thedict)
which can be used to convert the document for other tools.
# Convert the document to a dictionary representation:
as_dict = doc.to_dict()
as_dict
# Get a copy by creating a new Document from the dictionary representation
doc_copy = Document.from_dict(as_dict)
doc_copy
# Save the document in bdocjs format
doc.save("./data/docPy2Java.bdocjs")
docPy2Java.bdocjs
docJava2Py.bdocjs
docJava2Py.bdocjs
document into Python GateNLP¶jDoc = Document.load("./data/docJava2Py.bdocjs")
jDoc
Currently, GATE and Python can, somehow, interact in three ways!
gateslave
module, also import json
for loading tweet files.¶import json
from gatenlp.gateslave import GateSlaveAnnotator
corpus = []
with open("./data/tweetGateHate2GateSlave.jsonl", "rt") as infp:
for line in infp:
tweet = json.loads(line)
text=tweet["text"]
if "full_text" in tweet:
text= tweet["full_text"]
doc = Document(text)
for fname in ["id", "lang", "reply_count", "retweet_count", "quoted_status_id"]:
doc.features[fname] = tweet[fname]
corpus.append(doc)
print("Corpus created, number of documents:", len(corpus))
corpus[1]
gs_app = GateSlaveAnnotator(pipeline="./../../gate-hate-cloud-small/application.xgapp",
gatehome="/home/memin/gate_developer/GATE_Developer_8.6.1")
pipeline
parameter should point the location of a Gate applicationgatehome
points the Gate installation folder# Send documents to the Java Gate
gs_app.start()
for idx, doc in enumerate(corpus):
doc = gs_app(doc)
corpus[idx] = doc
gs_app.finish()
corpus[0]
corpus[1]
# corpus[2]
Currently, GATE and Python can, somehow, interact in three ways!
PythonPr
which allows the editing and running of python code within the GATE. For more: http://gatenlp.github.io/gateplugin-Python/PythonPr
Install a Python version 3.6 or later (3.7 or later highly recommended!), and sortedcontainers
package.
If you haven’t created yet!
conda create -n gatenlp python=3.9
conda activate gatenlp
conda install -c conda-forge sortedcontainers
pythonProgram
(ResourceReference, default: empty): locate the python source code that you want to run. Use the file selection dialog.When a pipeline that contains the PythonPr
processing resource is run, the following main steps are involved:
@gatenlp.GateNlpPr
decoratorgatenlp.interact()
function (see examples below)@GateNlpPr
function or the __call__
method of the implemented @GateNlpPr
class is invoked and the document is passed to that function. gatenlp
API to modify the document. All the changes are recorded.PythonPr
which applies the changes to the documents in Java GATE.Here is a simple example Python program which splits the document into white-space separated tokens using a simple regular expression and creates an annotation with the type “Token” in the default annotation set for each token.
import re
from gatenlp import GateNlpPr, interact
@GateNlpPr def run(doc, *kwargs): set1 = doc.annset() set1.clear() text = doc.text whitespaces = [m for m in re.finditer(r"[\s,.!?]+|^[\s,.!?]|[\s,.!?]*$", text)] for k in range(len(whitespaces) - 1): fromoff = whitespaces[k].end() tooff = whitespaces[k + 1].start() set1.add(fromoff, tooff, "Token", {"tokennr": k}) doc.features["nr_tokens"] = len(whitespaces) - 1
interact() # must be called </code>
Instead of a function, a callable class can be implemented with the @GateNlpPr
decorator.
__call__
method, but in addition can also implement the start
, finish
, reduce
and result
methods. interact()
call must be placed at the end of the Python script.
import re
from gatenlp import GateNlpPr, interact, logger
@GateNlpPr class MyProcessor: def init(self): self.tokens_total = 0
def start(self, **kwargs):
self.tokens_total = 0
def finish(self, **kwargs):
logger.info("Total number of tokens: {}".format(self.tokens_total))
def __call__(self, doc, **kwargs):
set1 = doc.annset()
set1.clear()
text = doc.text
whitespaces = [m for m in re.finditer(r"[\s,.!?]+|^[\s,.!?]*|[\s,.!?]*$", text)]
nrtokens = len(whitespaces) - 1
for k in range(nrtokens):
fromoff = whitespaces[k].end()
tooff = whitespaces[k + 1].start()
set1.add(fromoff, tooff, "Token", {"tokennr": k})
doc.features["nr_tokens"] = nrtokens
self.tokens_total += nrtokens
interact() </code>
Currently, GATE and Python can, somehow, interact in three ways!
Now you can dig into https://gatenlp.github.io/python-gatenlp/ to learn more!