tidymut.cleaners.k50_cleaner module
- class tidymut.cleaners.k50_cleaner.K50CleanerConfig(pipeline_name: str = 'k50_cleaner', num_workers: int = 16, validate_config: bool = True, column_mapping: ~typing.Dict[str, str] = <factory>, filters: ~typing.Dict[str, ~typing.Callable] = <factory>, type_conversions: ~typing.Dict[str, str] = <factory>, validation_workers: int = 16, infer_wt_workers: int = 16, handle_multiple_wt: ~typing.Literal['error', 'first', 'separate'] = 'error', label_columns: ~typing.List[str] = <factory>, primary_label_column: str = 'ddG')[source]
Bases:
BaseCleanerConfig
Configuration class for K50 dataset cleaner
Inherits from BaseCleanerConfig and adds K50-specific configuration options.
- column_mapping
Mapping from source to target column names
- Type:
Dict[str, str]
- filters
Filter conditions for data cleaning
- Type:
Dict[str, Callable]
- type_conversions
Data type conversion specifications
- Type:
Dict[str, str]
- validation_workers
Number of workers for mutation validation, set to -1 to use all available CPUs
- Type:
int
- infer_wt_workers
Number of workers for wildtype sequence inference, set to -1 to use all available CPUs
- Type:
int
- handle_multiple_wt
Strategy for handling multiple wildtype sequences (‘error’, ‘first’, ‘separate’)
- Type:
Literal[“error”, “first”, “separate”], default=”error”
- label_columns
List of score columns to process
- Type:
List[str]
- primary_label_column
Primary score column for the dataset
- Type:
str
- column_mapping: Dict[str, str]
- filters: Dict[str, Callable]
- handle_multiple_wt: Literal['error', 'first', 'separate'] = 'error'
- infer_wt_workers: int = 16
- label_columns: List[str]
- pipeline_name: str = 'k50_cleaner'
- primary_label_column: str = 'ddG'
- type_conversions: Dict[str, str]
- validate() None [source]
Validate K50-specific configuration parameters
- Raises:
ValueError – If configuration is invalid
- validation_workers: int = 16
- tidymut.cleaners.k50_cleaner.clean_k50_dataset(pipeline: Pipeline) Tuple[Pipeline, MutationDataset] [source]
Clean K50 dataset using configurable pipeline
- Parameters:
pipeline (Pipeline) – K50 dataset cleaning pipeline
- Returns:
Pipeline: The cleaned pipeline
MutationDataset: The cleaned K50 dataset
- Return type:
Tuple[Pipeline, MutationDataset]
- tidymut.cleaners.k50_cleaner.create_k50_cleaner(dataset_or_path: DataFrame | str | Path, config: K50CleanerConfig | Dict[str, Any] | str | Path | None = None) Pipeline [source]
Create K50 dataset cleaning pipeline
- Parameters:
dataset_or_path (Union[pd.DataFrame, str, Path]) – Raw K50 dataset DataFrame or file path to K50 dataset - Download from: https://zenodo.org/records/799292 - File: Tsuboyama2023_Dataset2_Dataset3_20230416.csv in Processed_K50_dG_datasets.zip
config (Optional[Union[K50CleanerConfig, Dict[str, Any], str, Path]]) – Configuration for the cleaning pipeline. Can be: - K50CleanerConfig object - Dictionary with configuration parameters (merged with defaults) - Path to JSON configuration file (str or Path) - None (uses default configuration)
- Returns:
Pipeline: The cleaning pipeline used
- Return type:
- Raises:
TypeError – If config has invalid type
ValueError – If configuration validation fails
Examples
Use default configuration: >>> pipeline, dataset = clean_k50_dataset(df)
Use partial configuration: >>> pipeline, dataset = clean_k50_dataset(df, config={ … “validation_workers”: 8, … “handle_multiple_wt”: “first” … })
Load configuration from file: >>> pipeline, dataset = clean_k50_dataset(df, config=”config.json”)