INDRA Database tools (indra.db
)¶
Database Manager (indra.db.database_manager
)¶
-
class
indra.db.database_manager.
DatabaseManager
(host, sqltype='postgresql', label=None)[source]¶ An object used to access INDRA’s database.
This object can be used to access and manage indra’s database. It includes both basic methods and some useful, more high-level methods. It is designed to be used with postgresql, or sqlite.
This object is primarily built around sqlalchemy, which is a required package for its use. It also optionally makes use of the pgcopy package for large data transfers.
If you wish to access the primary database, you can simply use the get_primary_db to get an instance of this object using the default settings.
Parameters: - host (str) – The database to which you want to interface.
- sqltype (OPTIONAL[str]) – The type of sql library used. Use one of the sql types provided by sqltypes. Default is sqltypes.POSTGRESQL
- label (OPTIONAL[str]) – A short string to indicate the purpose of the db instance. Set as primary when initialized be get_primary_db.
Example
If you wish to acces the primary database and find the the metadata for a particular pmid, 1234567:
>> from indra.db import get_primary_db() >> db = get_primary_db() >> res = db.select_all(db.TextRef, db.TextRef.pmid == ‘1234567’)
You will get a list of objects whose attributes give the metadata contained in the columns of the table.
For more sophisticated examples, several use cases can be found in indra.tests.test_db.
-
drop_tables
(tbl_list=None, force=False)[source]¶ Drop the tables for INDRA database given in tbl_list.
If tbl_list is None, all tables will be dropped. Note that if force is False, a warning prompt will be raised to asking for confirmation, as this action will remove all data from that table.
-
get_values
(entry_list, col_names=None, keyed=False)[source]¶ Get the column values from the entries in entry_list
-
insert
(tbl_name, ret_info='id', **input_dict)[source]¶ Insert a an entry into specified table, and return id.
-
insert_many
(tbl_name, input_dict_list, ret_info='id')[source]¶ Insert many records into the table given by table_name.
-
select_all
(tbls, *args)[source]¶ Select any and all entries from table given by tbl_name.
The results will be filtered by your keyword arguments. For example if you want to get a text ref with pmid ‘10532205’, you would call:
db.select_all('text_ref', db.TextRef.pmid == '10532205')
Note that double equals are required, not a single equal. Eqivalently you could call:
db.select_all(db.TextRef, db.TextRef.pmid == '10532205')
For a more complicated example, suppose you want to get all text refs that have full text from pmc oa, you could select:
db.select_all( [db.TextRef, db.TextContent], db.TextContent.text_ref_id == db.TextRef.id, db.TextContent.source == 'pmc_oa', db.TextContent.text_type == 'fulltext' )
Content Manager (indra.db.content_manager
)¶
-
class
indra.db.content_manager.
ContentManager
[source]¶ Abstract class for all upload/update managers.
This abstract class provides the api required for any object that is used to manage content between the database and the content.
-
copy_into_db
(db, tbl_name, data, cols=None, retry=True)[source]¶ Wrapper around the db.copy feature, pickels args upon exception.
-
filter_text_refs
(db, tr_data_set, primary_id_types=None)[source]¶ Try to reconcile the data we have with what’s already on the db.
Note that this method is VERY slow in general, and therefore should be avoided whenever possible.
The process can be sped up considerably by multiple orders of magnitude if you specify a limited set of id types to query to get text refs. This does leave some possibility of missing relevant refs.
-
-
class
indra.db.content_manager.
DatabaseError
[source]¶ Using this in a try-except will catch nothing. (That’s the point.)
-
class
indra.db.content_manager.
Manuscripts
(*args, **kwargs)[source]¶ ContentManager for the pmc manuscripts.
-
get_tarname_from_filename
(fname)[source]¶ Get the name of the tar file based on the file name (or a pmcid).
-
update
(db, n_procs=1)[source]¶ Add any new content found in the archives.
Note that this is very much the same as populating for manuscripts, as there are no finer grained means of getting manuscripts than just looking through the massive archive files. We do check to see if there are any new listings in each files, minimizing the amount of time downloading and searching, however this will in general be the slowest of the update methods.
The continuing feature isn’t implemented yet.
-
-
class
indra.db.content_manager.
NihFtpClient
(my_path, ftp_url='ftp.ncbi.nlm.nih.gov', local=False)[source]¶ High level access to the NIH FTP repositories.
Parameters: - my_path (str) – The path to the subdirectory around which this client operates.
- ftp_url (str) – The url to the ftp site. May be a local directory (see local). By default this is ‘ftp.ncbi.nlm.nih.gov’.
- local (bool) – These methods may be run on a local directory (intended for testing). (default is False).
-
ftp_ls_timestamped
(ftp_path=None)[source]¶ Get all contents and metadata in mlsd format from the ftp directory.
-
get_csv_as_dict
(csv_file, cols=None, header=None)[source]¶ Get the content from a csv file as a list of dicts.
-
class
indra.db.content_manager.
NihManager
(*args, **kwargs)[source]¶ Abstract class for all the managers that use the NIH FTP service.
See NihFtpClient for parameters.
-
class
indra.db.content_manager.
PmcManager
(*args, **kwargs)[source]¶ Abstract class for uploaders of PMC content.
For Paramters, see NihManager.
-
populate
(db, n_procs=1, continuing=False)[source]¶ Perform the initial population of the pmc content into the database.
Parameters: - db (indra.db.DatabaseManager instance) – The database to which the data will be uploaded.
- n_procs (int) – The number of processes to use when parsing xmls.
- continuing (bool) – If true, assume that we are picking up after an error, or otherwise continuing from an earlier process. This means we will skip over source files contained in the database. If false, all files will be read and parsed.
Returns: completed – If True, an update was completed. Othewise, the updload was aborted for some reason, often because the upload was already completed at some earlier time.
Return type: bool
-
process_archive
(archive, q=None, db=None, continuing=False)[source]¶ Download an archive and begin unpacking it.
Either q or db must be specified. The uncompressed contents of the archive will be loaded onto the database or placed on the queue in batches.
Parameters: - archive (str) – The path of the archive beneath the head of this sources ftp directory.
- q (multiprocessing.Queue) – When this method is called as a separate process, the contents of the archive are posted to a queue in batches to be handled externally.
- db (indra.db.DatabaseManager) – When not multprocessing, the contents of the archive are uploaded to the database directly by this method.
- continuing (bool) – True if this method is being called to complete an earlier failed attempt to execute this method; will not download the archive if an archive of the same name is already downloaded locally. Default is False.
-
unpack_archive_path
(archive_path, q=None, db=None, batch_size=10000)[source]¶ “Unpack the contents of an archive.
If q is given, then the data is put into the que to be handed off for upload by another process. Otherwise, if db is provided, upload the batches of data on this process. One or the other MUST be provided.
-
-
class
indra.db.content_manager.
PmcOA
(*args, **kwargs)[source]¶ ContentManager for the pmc open access content.
-
class
indra.db.content_manager.
Pubmed
(*args, **kwargs)[source]¶ Manager for the pubmed/medline content.
-
load_files
(db, dirname, n_procs=1, continuing=False, carefully=False)[source]¶ Load the files in subdirectory indicated by dirname.
-
load_text_refs
(db, article_info, carefully=False)[source]¶ Sanitize, update old, and upload new text refs.
-
populate
(db, n_procs=1, continuing=False)[source]¶ Perform the initial input of the pubmed content into the database.
Parameters: - db (indra.db.DatabaseManager instance) – The database to which the data will be uploaded.
- n_procs (int) – The number of processes to use when parsing xmls.
- continuing (bool) – If true, assume that we are picking up after an error, or otherwise continuing from an earlier process. This means we will skip over source files contained in the database. If false, all files will be read and parsed.
-
Utilities (indra.db.util
)¶
-
indra.db.util.
get_defaults
()[source]¶ Get the default database hosts provided in the specified DEFAULTS_FILE.
-
indra.db.util.
get_primary_db
(force_new=False)[source]¶ Get a DatabaseManager instance for the primary database host.
The primary database host is defined in the defaults.txt file, or in a file given by the environment variable DEFAULTS_FILE. Alternatively, it may be defined by the INDRADBPRIMARY environment variable. If none of the above are specified, this function will raise an exception.
Note: by default, calling this function twice will return the same DatabaseManager instance. In other words:
> db1 = get_primary_db() > db2 = get_primary_db() > db1 is db2 True
This means also that, for example db1.select_one(db2.TextRef) will work, in the above context.
It is still recommended that when creating a script or function, or other general application, you should not rely on this feature to get your access to the database, as it can make substituting a different database host both complicated and messy. Rather, a database instance should be explicitly passed between different users as is done in the by_gene_role_type function’s call to get_statements in indra.db.query_db_stmts.
Parameters: force_new (bool) – If true, a new instance will be created and returned, regardless of whether there is an existing instance or not. Default is False, so that if this function has been called before within the global scope, a the instance that was first created will be returned. Returns: primary_db – An instance of the database manager that is attached to the primary database. Return type: DatabaseManager instance
-
indra.db.util.
insert_agents
(db, stmts, *other_clauses)[source]¶ Insert the agents associated with the list of statements.
-
indra.db.util.
insert_db_stmts
(db, stmts, db_ref_id)[source]¶ Insert statement, their database, and any affiliated agents.
Note that this method is for uploading statements that came from a database to our databse, not for inserting any statements to the database.
-
indra.db.util.
get_abstracts_by_pmids
(db, pmid_list, unzip=True)[source]¶ Get abstracts using the pmids in pmid_list.
-
indra.db.util.
get_statements_by_gene_role_type
(agent_id=None, agent_ns='HGNC', role=None, stmt_type=None, count=1000, do_stmt_count=True, db=None)[source]¶ Get statements from the DB by stmt type, agent, and/or agent role.
Parameters: - agent_id (str) – String representing the identifier of the agent from the given namespace. Note: if the agent namespace argument, agent_ns, is set to ‘HGNC’, this function will treat agent_id as an HGNC gene symbol and perform an internal lookup of the corresponding HGNC ID.
- agent_ns (str) – Namespace for the identifier given in agent_id.
- role (str) – String corresponding to the role of the agent in the statement. Options are ‘SUBJECT’, ‘OBJECT’, or ‘OTHER’ (in the case of Complex, SelfModification, and ActiveForm Statements).
- stmt_type (str) – Name of the Statement class.
- count (int) – Number of statements to retrieve in each batch (passed to
get_statements()
). - do_stmt_count (bool) – Whether or not to perform an initial statement counting step to give more meaningful progress messages.
- db (indra.db.DatabaseManager object.) – Optionally specify a database manager that attaches to something besides the primary database, for example a local databse instance.
Returns: Return type: list of Statements from the database corresponding to the query.
-
indra.db.util.
get_statements
(clauses, count=1000, do_stmt_count=True, db=None)[source]¶ Select statements according to a given set of clauses.
Parameters: - clauses (list) – list of sqlalchemy WHERE clauses to pass to the filter query.
- count (int) – Number of statements to retrieve and process in each batch.
- do_stmt_count (bool) – Whether or not to perform an initial statement counting step to give more meaningful progress messages.
- db (indra.db.DatabaseManager object.) – Optionally specify a database manager that attaches to something besides the primary database, for example a local database instance.
Returns: Return type: list of Statements from the database corresponding to the query.