Specifying data for analysis

We introduce the concept of a “data store”. This represents the data record(s) that you want to analyse. It can be a single file, a directory of files, a zipped directory of files or a single sqlitedb file containing multiple data records.

We represent this concept by a DataStore class. There are different flavours of these:

  • directory based

  • Sqlite based

All of these types support being indexed, iterated over, etc..

A read only data store

To create one of these, you provide a path AND a suffix of the files within the directory / zip that you will be analysing. (If the path ends with .sqlitedb, no file suffix is required.)

from cogent3 import open_data_store

dstore = open_data_store("data/raw.zip", suffix="fa*", limit=5, mode="r")

Data store “members”

These are able to read their own raw data.

m = dstore[0]
m
'/Users/gavin/repos/Cogent3/doc/data/raw/ENSG00000157184.fa'
m.read()[:20]  # truncating
'>Human\nATGGTGCCCCGCC'

Looping over a data store

for m in dstore:
    print(m)
/Users/gavin/repos/Cogent3/doc/data/raw/ENSG00000157184.fa
/Users/gavin/repos/Cogent3/doc/data/raw/ENSG00000131791.fa
/Users/gavin/repos/Cogent3/doc/data/raw/ENSG00000127054.fa
/Users/gavin/repos/Cogent3/doc/data/raw/ENSG00000067704.fa
/Users/gavin/repos/Cogent3/doc/data/raw/ENSG00000182004.fa

Making a writeable data store

The creation of a writeable data store is specified with mode="w", or (to append) mode="a". In the former case, any existing records are overwritten. In the latter case, existing records are ignored.

Sqlitedb data stores for serialised data

When you specify a Sqlitedb data store as your output (by using open_data_store()) you write multiple records into a single file making distribution easier.

One important issue to note is the process which creates a Sqlitedb “locks” the file. If that process exits unnaturally (e.g. the run that was producing it was interrupted) then the file may remain in a locked state. If the db is in this state, cogent3 will not modify it unless you explicitly unlock it.

This is represented in the display as shown below.

dstore = open_data_store("data/demo-locked.sqlitedb")
dstore.describe
Unlocked db store.
record typenumber
completed175
not_completed0
logs1

3 rows x 2 columns

To unlock, you execute the following:

dstore.unlock(force=True)

Interrogating run logs

If you use the apply_to() method, a scitrack logfile will be included in the data store. This includes useful information regarding the run conditions that produced the contents of the data store.

dstore.summary_logs
summary of log files
timenamepython versionwhocommandcomposable
2019-07-24 14:42:56logs/load_unaligned-progressive_align-write_db-pid8650.log3.7.3gavin/Users/gavin/miniconda3/envs/c3dev/lib/python3.7/site-packages/ipykernel_launcher.py -f /Users/gavin/Library/Jupyter/runtime/kernel-5eb93aeb-f6e0-493e-85d1-d62895201ae2.jsonload_unaligned(type='sequences', moltype='dna', format='fasta') + progressive_align(type='sequences', model='HKY85', gc=None, param_vals={'kappa': 3}, guide_tree=None, unique_guides=False, indel_length=0.1, indel_rate=1e-10) + write_db(type='output', data_path='../data/aligned-nt.tinydb', name_callback=None, create=True, if_exists='overwrite', suffix='json')

1 rows x 6 columns

Log files can be accessed vial a special attribute.

dstore.logs
[DataMember(data_store=/Users/gavin/repos/Cogent3/doc/data/demo-locked.sqlitedb, unique_id=logs/load_unaligned-progressive_align-write_db-pid8650.log)]

Each element in that list is a DataMember which you can use to get the data contents.

print(dstore.logs[0].read()[:225])  # truncated for clarity
2019-07-24 14:42:56	Eratosthenes.local:8650	INFO	system_details : system=Darwin Kernel Version 18.6.0: Thu Apr 25 23:16:27 PDT 2019; root:xnu-4903.261.4~2/RELEASE_X86_64
2019-07-24 14:42:56	Eratosthenes.local:8650	INFO	python