Specifying data for analysis¶
We introduce the concept of a “data store”. This represents the data record(s) that you want to analyse. It can be a single file, a directory of files, a zipped directory of files or a single sqlitedb
file containing multiple data records.
We represent this concept by a DataStore
class. There are different flavours of these:
directory based
Sqlite based
All of these types support being indexed, iterated over, etc..
A read only data store¶
To create one of these, you provide a path
AND a suffix
of the files within the directory / zip that you will be analysing. (If the path ends with .sqlitedb
, no file suffix is required.)
from cogent3 import open_data_store
dstore = open_data_store("data/raw.zip", suffix="fa*", limit=5, mode="r")
Data store “members”¶
These are able to read their own raw data.
m = dstore[0]
m
'/Users/gavin/repos/Cogent3/doc/data/raw/ENSG00000157184.fa'
m.read()[:20] # truncating
'>Human\nATGGTGCCCCGCC'
Looping over a data store¶
for m in dstore:
print(m)
/Users/gavin/repos/Cogent3/doc/data/raw/ENSG00000157184.fa
/Users/gavin/repos/Cogent3/doc/data/raw/ENSG00000131791.fa
/Users/gavin/repos/Cogent3/doc/data/raw/ENSG00000127054.fa
/Users/gavin/repos/Cogent3/doc/data/raw/ENSG00000067704.fa
/Users/gavin/repos/Cogent3/doc/data/raw/ENSG00000182004.fa
Making a writeable data store¶
The creation of a writeable data store is specified with mode="w"
, or (to append) mode="a"
. In the former case, any existing records are overwritten. In the latter case, existing records are ignored.
Sqlitedb data stores for serialised data¶
When you specify a Sqlitedb data store as your output (by using open_data_store()
) you write multiple records into a single file making distribution easier.
One important issue to note is the process which creates a Sqlitedb “locks” the file. If that process exits unnaturally (e.g. the run that was producing it was interrupted) then the file may remain in a locked state. If the db is in this state, cogent3
will not modify it unless you explicitly unlock it.
This is represented in the display as shown below.
dstore = open_data_store("data/demo-locked.sqlitedb")
dstore.describe
record type | number |
---|---|
completed | 175 |
not_completed | 0 |
logs | 1 |
3 rows x 2 columns
To unlock, you execute the following:
dstore.unlock(force=True)
Interrogating run logs¶
If you use the apply_to()
method, a scitrack
logfile will be included in the data store. This includes useful information regarding the run conditions that produced the contents of the data store.
dstore.summary_logs
time | name | python version | who | command | composable |
---|---|---|---|---|---|
2019-07-24 14:42:56 | logs/load_unaligned-progressive_align-write_db-pid8650.log | 3.7.3 | gavin | /Users/gavin/miniconda3/envs/c3dev/lib/python3.7/site-packages/ipykernel_launcher.py -f /Users/gavin/Library/Jupyter/runtime/kernel-5eb93aeb-f6e0-493e-85d1-d62895201ae2.json | load_unaligned(type='sequences', moltype='dna', format='fasta') + progressive_align(type='sequences', model='HKY85', gc=None, param_vals={'kappa': 3}, guide_tree=None, unique_guides=False, indel_length=0.1, indel_rate=1e-10) + write_db(type='output', data_path='../data/aligned-nt.tinydb', name_callback=None, create=True, if_exists='overwrite', suffix='json') |
1 rows x 6 columns
Log files can be accessed vial a special attribute.
dstore.logs
[DataMember(data_store=/Users/gavin/repos/Cogent3/doc/data/demo-locked.sqlitedb, unique_id=logs/load_unaligned-progressive_align-write_db-pid8650.log)]
Each element in that list is a DataMember
which you can use to get the data contents.
print(dstore.logs[0].read()[:225]) # truncated for clarity
2019-07-24 14:42:56 Eratosthenes.local:8650 INFO system_details : system=Darwin Kernel Version 18.6.0: Thu Apr 25 23:16:27 PDT 2019; root:xnu-4903.261.4~2/RELEASE_X86_64
2019-07-24 14:42:56 Eratosthenes.local:8650 INFO python