gatenlp.document module

class gatenlp.document.Document(text: str = None, features=None, changelog: gatenlp.changelog.ChangeLog = None)[source]

Bases: gatenlp.feature_bearer.FeatureBearer

Represent a GATE document. This is different from the original Java GATE representation in several ways:

  • the text is not mutable and can only be set at creation time, so there is no “edit” method

  • as a feature bearer, all the methods to set, get and manipulate features are part of this class, there is no separate “FeatureMap” to store them

  • does not support listener callbacks

  • there is no separate abstraction for “content”, the only content possible is text which is a unicode string that can be acessed with the “text()” method

  • Spans of text can be directly accessed using doc[from:to]

  • features are not stored in a separate feature map object, but are directly set on the document, e.g. doc.set_feature(“x”,y) or doc.get_feature(“x”, defaultvalue)

  • Features may only have string keys and values which can be json-serialised

  • Annotation offsets by default are number of Unicde code points, this is different from Java where the offsets are UTF-16 Unicode code units

  • Offsets of all annotations can be changed from/to Java (from python index of unicode codepoint to Java index of UTF-16 code unit and back)

  • No part of the document has to be present, not even the text (this allows saving just the annotations separately from the text)

  • Once the text has been set, it is immutable (no support to edit text and change annotation offsets accordingly)

Parameters
  • text – the text of the document. The text can be None to indicate that no initial text should be set. Once the text has been set for a document, it is immutable and cannot be changed.

  • features – the initial document features to set, a sequence of key/value tuples

  • changelog – a ChangeLog instance to use to log changes.

Initialise any features, if necessary. :param initialfeatures: an iterable containing tuples of initial feature key/value pairs :return:

static from_dict(dictrepr, **kwargs)[source]

Return a Document instance as represented by the dictionary dictrepr. :param dictrepr: :return: the initialized Document instance

get_annotation_set_names() → KeysView[str][source]

Return the set of known annotation set names.

Returns

annotation set names

get_annotations(name: str = '')gatenlp.annotation_set.AnnotationSet[source]

Get the named annotation set, if name is not given or the empty string, the default annotation set. If the annotation set does not already exist, it is created.

Parameters

name – the annotation set name, the empty string is used for the “default annotation set”.

Returns

the specified annotation set.

static load(wherefrom, fmt='json', offset_type=None, mod='gatenlp.serialization.default', **kwargs)[source]
Parameters
  • wherefrom

  • fmt

  • offset_type – make sure to store using the given offset type

  • kwargs

Returns

static load_mem(wherefrom, fmt='json', mod='gatenlp.serialization.default', **kwargs)[source]

Note: the offset type is always converted to PYTHON when loading!

Parameters
  • wherefrom – the string to deserialize

  • fmt

  • kwargs

Returns

remove_annotation_set(name: str)[source]

Completely remove the annotation set. :param name: name of the annotation set to remove :return:

save(whereto, fmt='json', offset_type=None, mod='gatenlp.serialization.default', **kwargs)[source]

Save the document in the given format.

Additional keyword parameters for format “json”: * as_array: boolean, if True stores as array instead of dictionary, using to

Parameters
  • whereto – either a file name or something that has a write(string) method.

  • fmt – serialization format, one of “json”, “msgpack” or “pickle”

  • offset_type – store using the given offset type or keep the current if None

  • mod – module to use

  • kwargs – additional parameters for the format

Returns

save_mem(fmt='json', offset_type=None, mod='gatenlp.serialization.default', **kwargs)[source]

Serialize and save to a string.

Additional keyword parameters for format “json”: * as_array: boolean, if True stores as array instead of dictionary, using to

Parameters
  • fmt – serialization format, one of “json”, “msgpack” or “pickle”

  • offset_type – store using the given offset type or keep the current if None

  • mod – module to use

  • kwargs – additional parameters for the format

Returns

set_changelog(chlog: gatenlp.changelog.ChangeLog)gatenlp.changelog.ChangeLog[source]

Make the document use the given changelog to record all changes from this moment on.

Parameters

chlog – the new changelog to use or None to not use any

Returns

the changelog used previously or None

size() → int[source]

Return the size of the document text. Note: this will convert the type of the document to python!

Returns

size of the document (length of the text)

property text

Get the text of the document. For a partial document, the text may be None.

Returns

the text of the document

to_dict(offset_type=None, **kwargs)[source]

Convert this instance to a dictionary that can be used to re-create the instance with from_dict. NOTE: if there is an active changelog, it is not included in the output as this field is considered a transient field!

Parameters

offset_type – convert to the given offset type on the fly

Returns

the dictionary representation of this instance

to_type(offsettype: str) → None[source]

Convert all the offsets of all the annotations in this document to the required type, either OFFSET_TYPE_JAVA or OFFSET_TYPE_PYTHON. If the offsets are already of that type, this does nothing.

NOTE: if the document has a ChangeLog, it is NOT also converted!

The method returns the offset mapper if anything actually was converted, otherwise None.

Parameters

offsettype – either OFFSET_TYPE_JAVA or OFFSET_TYPE_PYTHON

Returns

offset mapper or None