INDRA Database tools (indra.db)

Database Manager (indra.db.database_manager)

class indra.db.database_manager.DatabaseManager(host, sqltype='postgresql', label=None)[source]

An object used to access INDRA’s database.

This object can be used to access and manage indra’s database. It includes both basic methods and some useful, more high-level methods. It is designed to be used with postgresql, or sqlite.

This object is primarily built around sqlalchemy, which is a required package for its use. It also optionally makes use of the pgcopy package for large data transfers.

If you wish to access the primary database, you can simply use the get_primary_db to get an instance of this object using the default settings.

Parameters:
  • host (str) – The database to which you want to interface.
  • sqltype (OPTIONAL[str]) – The type of sql library used. Use one of the sql types provided by sqltypes. Default is sqltypes.POSTGRESQL
  • label (OPTIONAL[str]) – A short string to indicate the purpose of the db instance. Set as primary when initialized be get_primary_db.

Example

If you wish to acces the primary database and find the the metadata for a particular pmid, 1234567:

>> from indra.db import get_primary_db() >> db = get_primary_db() >> res = db.select_all(db.TextRef, db.TextRef.pmid == ‘1234567’)

You will get a list of objects whose attributes give the metadata contained in the columns of the table.

For more sophisticated examples, several use cases can be found in indra.tests.test_db.

commit(err_msg)[source]

Commit, and give useful info if there is an exception.

copy(tbl_name, data, cols=None)[source]

Use pg_copy to copy over a large amount of data.

create_tables(tbl_list=None)[source]

Create the tables for INDRA database.

delete_all(entry_list)[source]

Remove the given records from the given table.

drop_tables(tbl_list=None, force=False)[source]

Drop the tables for INDRA database given in tbl_list.

If tbl_list is None, all tables will be dropped. Note that if force is False, a warning prompt will be raised to asking for confirmation, as this action will remove all data from that table.

filter_query(tbls, *args)[source]

Query a table and filter results.

get_active_tables()[source]

Get the tables currently active in the database.

get_column_names(tbl_name)[source]

Get a list of the column labels for a table.

get_column_objects(table)[source]

Get a list of the column object for the given table.

get_tables()[source]

Get a list of available tables.

get_values(entry_list, col_names=None, keyed=False)[source]

Get the column values from the entries in entry_list

grab_session()[source]

Get an active session with the database.

has_entry(tbls, *args)[source]

Check whether an entry/entries matching given specs live in the db.

insert(tbl_name, ret_info='id', **input_dict)[source]

Insert a an entry into specified table, and return id.

insert_many(tbl_name, input_dict_list, ret_info='id')[source]

Insert many records into the table given by table_name.

select_all(tbls, *args)[source]

Select any and all entries from table given by tbl_name.

The results will be filtered by your keyword arguments. For example if you want to get a text ref with pmid ‘10532205’, you would call:

db.select_all('text_ref', db.TextRef.pmid == '10532205')

Note that double equals are required, not a single equal. Eqivalently you could call:

db.select_all(db.TextRef, db.TextRef.pmid == '10532205')

For a more complicated example, suppose you want to get all text refs that have full text from pmc oa, you could select:

db.select_all(
    [db.TextRef, db.TextContent],
    db.TextContent.text_ref_id == db.TextRef.id,
    db.TextContent.source == 'pmc_oa',
    db.TextContent.text_type == 'fulltext'
    )
select_one(tbls, *args)[source]

Select the first value that matches requirements.

Requirements are given in kwargs from table indicated by tbl_name. See select_all.

Note that if your specification yields multiple results, this method will just return the first result without exception.

show_tables()[source]

Print a list of all the available tables.

exception indra.db.database_manager.IndraDatabaseError[source]

Content Manager (indra.db.content_manager)

class indra.db.content_manager.ContentManager[source]

Abstract class for all upload/update managers.

This abstract class provides the api required for any object that is used to manage content between the database and the content.

add_to_review(desc, msg)[source]

Add an entry to the review document.

copy_into_db(db, tbl_name, data, cols=None, retry=True)[source]

Wrapper around the db.copy feature, pickels args upon exception.

filter_text_refs(db, tr_data_set, primary_id_types=None)[source]

Try to reconcile the data we have with what’s already on the db.

Note that this method is VERY slow in general, and therefore should be avoided whenever possible.

The process can be sped up considerably by multiple orders of magnitude if you specify a limited set of id types to query to get text refs. This does leave some possibility of missing relevant refs.

make_text_ref_str(tr)[source]

Make a string from a text ref using tr_cols.

populate(db)[source]

A stub for the method used to initially populate the database.

update(db)[source]

A stub for the method used to update the content on the database.

class indra.db.content_manager.DatabaseError[source]

Using this in a try-except will catch nothing. (That’s the point.)

class indra.db.content_manager.Manuscripts(*args, **kwargs)[source]

ContentManager for the pmc manuscripts.

enrich_textrefs(db)[source]

Method to add manuscript_ids to the text refs.

get_tarname_from_filename(fname)[source]

Get the name of the tar file based on the file name (or a pmcid).

update(db, n_procs=1)[source]

Add any new content found in the archives.

Note that this is very much the same as populating for manuscripts, as there are no finer grained means of getting manuscripts than just looking through the massive archive files. We do check to see if there are any new listings in each files, minimizing the amount of time downloading and searching, however this will in general be the slowest of the update methods.

The continuing feature isn’t implemented yet.

class indra.db.content_manager.NihFtpClient(my_path, ftp_url='ftp.ncbi.nlm.nih.gov', local=False)[source]

High level access to the NIH FTP repositories.

Parameters:
  • my_path (str) – The path to the subdirectory around which this client operates.
  • ftp_url (str) – The url to the ftp site. May be a local directory (see local). By default this is ‘ftp.ncbi.nlm.nih.gov’.
  • local (bool) – These methods may be run on a local directory (intended for testing). (default is False).
download_file(f_path, dest=None)[source]

Download a file into a file given by f_path.

ftp_ls(ftp_path=None)[source]

Get a list of the contents in the ftp directory.

ftp_ls_timestamped(ftp_path=None)[source]

Get all contents and metadata in mlsd format from the ftp directory.

get_csv_as_dict(csv_file, cols=None, header=None)[source]

Get the content from a csv file as a list of dicts.

get_file(f_path, force_str=True, decompress=True)[source]

Get the contents of a file as a string.

get_uncompressed_bytes(f_path, force_str=True)[source]

Get a file that is gzipped, and return the unzipped string.

get_xml_file(xml_file)[source]

Get the content from an xml file as an ElementTree.

ret_file(f_path, buf)[source]

Load the content of a file into the given buffer.

class indra.db.content_manager.NihManager(*args, **kwargs)[source]

Abstract class for all the managers that use the NIH FTP service.

See NihFtpClient for parameters.

class indra.db.content_manager.PmcManager(*args, **kwargs)[source]

Abstract class for uploaders of PMC content.

For Paramters, see NihManager.

filter_text_content(db, tc_data)[source]

Filter the text content to identify pre-existing records.

get_data_from_xml_str(xml_str, filename)[source]

Get the data out of the xml string.

get_missing_pmids(db, tr_data)[source]

Try to get missing pmids using the pmc client.

populate(db, n_procs=1, continuing=False)[source]

Perform the initial population of the pmc content into the database.

Parameters:
  • db (indra.db.DatabaseManager instance) – The database to which the data will be uploaded.
  • n_procs (int) – The number of processes to use when parsing xmls.
  • continuing (bool) – If true, assume that we are picking up after an error, or otherwise continuing from an earlier process. This means we will skip over source files contained in the database. If false, all files will be read and parsed.
Returns:

completed – If True, an update was completed. Othewise, the updload was aborted for some reason, often because the upload was already completed at some earlier time.

Return type:

bool

process_archive(archive, q=None, db=None, continuing=False)[source]

Download an archive and begin unpacking it.

Either q or db must be specified. The uncompressed contents of the archive will be loaded onto the database or placed on the queue in batches.

Parameters:
  • archive (str) – The path of the archive beneath the head of this sources ftp directory.
  • q (multiprocessing.Queue) – When this method is called as a separate process, the contents of the archive are posted to a queue in batches to be handled externally.
  • db (indra.db.DatabaseManager) – When not multprocessing, the contents of the archive are uploaded to the database directly by this method.
  • continuing (bool) – True if this method is being called to complete an earlier failed attempt to execute this method; will not download the archive if an archive of the same name is already downloaded locally. Default is False.
unpack_archive_path(archive_path, q=None, db=None, batch_size=10000)[source]

“Unpack the contents of an archive.

If q is given, then the data is put into the que to be handed off for upload by another process. Otherwise, if db is provided, upload the batches of data on this process. One or the other MUST be provided.

upload_archives(db, archives, n_procs=1, continuing=False)[source]

Do the grunt work of downloading and processing a list of archives.

upload_batch(db, tr_data, tc_data)[source]

Add a batch of text refs and text content to the database.

class indra.db.content_manager.PmcOA(*args, **kwargs)[source]

ContentManager for the pmc open access content.

update(db, n_procs=1)[source]

A stub for the method used to update the content on the database.

class indra.db.content_manager.Pubmed(*args, **kwargs)[source]

Manager for the pubmed/medline content.

fix_doi(doi)[source]

Sometimes the doi is doubled (no idea why). Fix it.

load_files(db, dirname, n_procs=1, continuing=False, carefully=False)[source]

Load the files in subdirectory indicated by dirname.

load_text_refs(db, article_info, carefully=False)[source]

Sanitize, update old, and upload new text refs.

populate(db, n_procs=1, continuing=False)[source]

Perform the initial input of the pubmed content into the database.

Parameters:
  • db (indra.db.DatabaseManager instance) – The database to which the data will be uploaded.
  • n_procs (int) – The number of processes to use when parsing xmls.
  • continuing (bool) – If true, assume that we are picking up after an error, or otherwise continuing from an earlier process. This means we will skip over source files contained in the database. If false, all files will be read and parsed.
update(db, n_procs=1)[source]

Update the contents of the database with the latest articles.

upload_article(db, article_info, carefully=False)[source]

Process the content of an xml dataset and load into the database.

exception indra.db.content_manager.UploadError[source]

Utilities (indra.db.util)

indra.db.util.get_defaults()[source]

Get the default database hosts provided in the specified DEFAULTS_FILE.

indra.db.util.get_primary_db(force_new=False)[source]

Get a DatabaseManager instance for the primary database host.

The primary database host is defined in the defaults.txt file, or in a file given by the environment variable DEFAULTS_FILE. Alternatively, it may be defined by the INDRADBPRIMARY environment variable. If none of the above are specified, this function will raise an exception.

Note: by default, calling this function twice will return the same DatabaseManager instance. In other words:

> db1 = get_primary_db() > db2 = get_primary_db() > db1 is db2 True

This means also that, for example db1.select_one(db2.TextRef) will work, in the above context.

It is still recommended that when creating a script or function, or other general application, you should not rely on this feature to get your access to the database, as it can make substituting a different database host both complicated and messy. Rather, a database instance should be explicitly passed between different users as is done in the by_gene_role_type function’s call to get_statements in indra.db.query_db_stmts.

Parameters:force_new (bool) – If true, a new instance will be created and returned, regardless of whether there is an existing instance or not. Default is False, so that if this function has been called before within the global scope, a the instance that was first created will be returned.
Returns:primary_db – An instance of the database manager that is attached to the primary database.
Return type:DatabaseManager instance
indra.db.util.insert_agents(db, stmts, *other_clauses)[source]

Insert the agents associated with the list of statements.

indra.db.util.insert_db_stmts(db, stmts, db_ref_id)[source]

Insert statement, their database, and any affiliated agents.

Note that this method is for uploading statements that came from a database to our databse, not for inserting any statements to the database.

indra.db.util.get_abstracts_by_pmids(db, pmid_list, unzip=True)[source]

Get abstracts using the pmids in pmid_list.

indra.db.util.get_statements_by_gene_role_type(agent_id=None, agent_ns='HGNC', role=None, stmt_type=None, count=1000, do_stmt_count=True, db=None)[source]

Get statements from the DB by stmt type, agent, and/or agent role.

Parameters:
  • agent_id (str) – String representing the identifier of the agent from the given namespace. Note: if the agent namespace argument, agent_ns, is set to ‘HGNC’, this function will treat agent_id as an HGNC gene symbol and perform an internal lookup of the corresponding HGNC ID.
  • agent_ns (str) – Namespace for the identifier given in agent_id.
  • role (str) – String corresponding to the role of the agent in the statement. Options are ‘SUBJECT’, ‘OBJECT’, or ‘OTHER’ (in the case of Complex, SelfModification, and ActiveForm Statements).
  • stmt_type (str) – Name of the Statement class.
  • count (int) – Number of statements to retrieve in each batch (passed to get_statements()).
  • do_stmt_count (bool) – Whether or not to perform an initial statement counting step to give more meaningful progress messages.
  • db (indra.db.DatabaseManager object.) – Optionally specify a database manager that attaches to something besides the primary database, for example a local databse instance.
Returns:

Return type:

list of Statements from the database corresponding to the query.

indra.db.util.get_statements(clauses, count=1000, do_stmt_count=True, db=None)[source]

Select statements according to a given set of clauses.

Parameters:
  • clauses (list) – list of sqlalchemy WHERE clauses to pass to the filter query.
  • count (int) – Number of statements to retrieve and process in each batch.
  • do_stmt_count (bool) – Whether or not to perform an initial statement counting step to give more meaningful progress messages.
  • db (indra.db.DatabaseManager object.) – Optionally specify a database manager that attaches to something besides the primary database, for example a local database instance.
Returns:

Return type:

list of Statements from the database corresponding to the query.