gatenlp.document module¶
-
class
gatenlp.document.
Document
(text: str = None, features=None, changelog: gatenlp.changelog.ChangeLog = None)[source]¶ Bases:
gatenlp.feature_bearer.FeatureBearer
Represent a GATE document. This is different from the original Java GATE representation in several ways:
the text is not mutable and can only be set at creation time, so there is no “edit” method
as a feature bearer, all the methods to set, get and manipulate features are part of this class, there is no separate “FeatureMap” to store them
does not support listener callbacks
there is no separate abstraction for “content”, the only content possible is text which is a unicode string that can be acessed with the “text()” method
Spans of text can be directly accessed using doc[from:to]
features are not stored in a separate feature map object, but are directly set on the document, e.g. doc.set_feature(“x”,y) or doc.get_feature(“x”, defaultvalue)
Features may only have string keys and values which can be json-serialised
Annotation offsets by default are number of Unicde code points, this is different from Java where the offsets are UTF-16 Unicode code units
Offsets of all annotations can be changed from/to Java (from python index of unicode codepoint to Java index of UTF-16 code unit and back)
No part of the document has to be present, not even the text (this allows saving just the annotations separately from the text)
Once the text has been set, it is immutable (no support to edit text and change annotation offsets accordingly)
- Parameters
text – the text of the document. The text can be None to indicate that no initial text should be set. Once the text has been set for a document, it is immutable and cannot be changed.
features – the initial document features to set, a sequence of key/value tuples
changelog – a ChangeLog instance to use to log changes.
Initialise any features, if necessary. :param initialfeatures: an iterable containing tuples of initial feature key/value pairs :return:
-
static
from_dict
(dictrepr, **kwargs)[source]¶ Return a Document instance as represented by the dictionary dictrepr. :param dictrepr: :return: the initialized Document instance
-
get_annotation_set_names
() → KeysView[str][source]¶ Return the set of known annotation set names.
- Returns
annotation set names
-
get_annotations
(name: str = '') → gatenlp.annotation_set.AnnotationSet[source]¶ Get the named annotation set, if name is not given or the empty string, the default annotation set. If the annotation set does not already exist, it is created.
- Parameters
name – the annotation set name, the empty string is used for the “default annotation set”.
- Returns
the specified annotation set.
-
static
load
(wherefrom, fmt='json', offset_type=None, mod='gatenlp.serialization.default', **kwargs)[source]¶ - Parameters
wherefrom –
fmt –
offset_type – make sure to store using the given offset type
kwargs –
- Returns
-
static
load_mem
(wherefrom, fmt='json', mod='gatenlp.serialization.default', **kwargs)[source]¶ Note: the offset type is always converted to PYTHON when loading!
- Parameters
wherefrom – the string to deserialize
fmt –
kwargs –
- Returns
-
remove_annotation_set
(name: str)[source]¶ Completely remove the annotation set. :param name: name of the annotation set to remove :return:
-
save
(whereto, fmt='json', offset_type=None, mod='gatenlp.serialization.default', **kwargs)[source]¶ Save the document in the given format.
Additional keyword parameters for format “json”: * as_array: boolean, if True stores as array instead of dictionary, using to
- Parameters
whereto – either a file name or something that has a write(string) method.
fmt – serialization format, one of “json”, “msgpack” or “pickle”
offset_type – store using the given offset type or keep the current if None
mod – module to use
kwargs – additional parameters for the format
- Returns
-
save_mem
(fmt='json', offset_type=None, mod='gatenlp.serialization.default', **kwargs)[source]¶ Serialize and save to a string.
Additional keyword parameters for format “json”: * as_array: boolean, if True stores as array instead of dictionary, using to
- Parameters
fmt – serialization format, one of “json”, “msgpack” or “pickle”
offset_type – store using the given offset type or keep the current if None
mod – module to use
kwargs – additional parameters for the format
- Returns
-
set_changelog
(chlog: gatenlp.changelog.ChangeLog) → gatenlp.changelog.ChangeLog[source]¶ Make the document use the given changelog to record all changes from this moment on.
- Parameters
chlog – the new changelog to use or None to not use any
- Returns
the changelog used previously or None
-
size
() → int[source]¶ Return the size of the document text. Note: this will convert the type of the document to python!
- Returns
size of the document (length of the text)
-
property
text
¶ Get the text of the document. For a partial document, the text may be None.
- Returns
the text of the document
-
to_dict
(offset_type=None, **kwargs)[source]¶ Convert this instance to a dictionary that can be used to re-create the instance with from_dict. NOTE: if there is an active changelog, it is not included in the output as this field is considered a transient field!
- Parameters
offset_type – convert to the given offset type on the fly
- Returns
the dictionary representation of this instance
-
to_type
(offsettype: str) → None[source]¶ Convert all the offsets of all the annotations in this document to the required type, either OFFSET_TYPE_JAVA or OFFSET_TYPE_PYTHON. If the offsets are already of that type, this does nothing.
NOTE: if the document has a ChangeLog, it is NOT also converted!
The method returns the offset mapper if anything actually was converted, otherwise None.
- Parameters
offsettype – either OFFSET_TYPE_JAVA or OFFSET_TYPE_PYTHON
- Returns
offset mapper or None