spacr.ml¶
Module Contents¶
- class spacr.ml.QuasiBinomial(link=logit(), dispersion=1.0)[source]¶
Bases:
statsmodels.genmod.families.Binomial
Custom Quasi-Binomial family with adjustable variance.
- spacr.ml.create_volcano_filename(csv_path, regression_type, alpha, dst)[source]¶
Create and return the volcano plot filename based on regression type and alpha.
- spacr.ml.scale_variables(X, y)[source]¶
Scale independent (X) and dependent (y) variables using MinMaxScaler.
- spacr.ml.process_model_coefficients(model, regression_type, X, y, nc, pc, controls)[source]¶
Return DataFrame of model coefficients and p-values.
- spacr.ml.prepare_formula(dependent_variable, random_row_column_effects=False)[source]¶
Return the regression formula using random effects for plate, row, and column.
- spacr.ml.check_and_clean_data(df, dependent_variable)[source]¶
Check for collinearity, missing values, or invalid types in relevant columns. Clean data accordingly.
- spacr.ml.check_normality(y, variable_name)[source]¶
Check if the data is normally distributed using the Shapiro-Wilk test.
- spacr.ml.minimum_cell_simulation(settings, num_repeats=10, sample_size=100, tolerance=0.02, smoothing=10, increment=10)[source]¶
Plot the mean absolute difference with standard deviation as shaded area vs. sample size. Detect and mark the elbow point (inflection) with smoothing and tolerance control.
- spacr.ml.process_model_coefficients(model, regression_type, X, y, nc, pc, controls)[source]¶
Return DataFrame of model coefficients, standard errors, and p-values.
- spacr.ml.check_distribution(y, epsilon=1e-06)[source]¶
Check the distribution of y and recommend an appropriate model.
- spacr.ml.pick_glm_family_and_link(y)[source]¶
Select the appropriate GLM family and link function based on data.
- spacr.ml.regression_model(X, y, regression_type='ols', groups=None, alpha=1.0, cov_type=None)[source]¶
- spacr.ml.regression(df, csv_path, dependent_variable='predictions', regression_type=None, alpha=1.0, random_row_column_effects=False, nc='233460', pc='220950', controls=[''], dst=None, cov_type=None, plot=False)[source]¶
- spacr.ml.save_summary_to_file(model, file_path='summary.csv')[source]¶
Save the model’s summary output to a CSV or text file.
- spacr.ml.process_reads(csv_path, fraction_threshold, plate, filter_column=None, filter_value=None)[source]¶
- spacr.ml.check_normality(data, variable_name, verbose=False)[source]¶
Check if the data is normally distributed using the Shapiro-Wilk test.
- spacr.ml.process_scores(df, dependent_variable, plate, min_cell_count=25, agg_type='mean', transform=None, regression_type='ols')[source]¶
- spacr.ml.ml_analysis(df, channel_of_interest=3, location_column='columnID', positive_control='c2', negative_control='c1', exclude=None, n_repeats=10, top_features=30, reg_alpha=0.1, reg_lambda=1.0, learning_rate=1e-05, n_estimators=1000, test_size=0.2, model_type='xgboost', n_jobs=-1, remove_low_variance_features=True, remove_highly_correlated_features=True, prune_features=False, cross_validation=False, verbose=False)[source]¶
Calculates permutation importance for numerical features in the dataframe, comparing groups based on specified column values and uses the model to predict the class for all other rows in the dataframe.
Args: df (pandas.DataFrame): The DataFrame containing the data. feature_string (str): String to filter features that contain this substring. location_column (str): Column name to use for comparing groups. positive_control, negative_control (str): Values in location_column to create subsets for comparison. exclude (list or str, optional): Columns to exclude from features. n_repeats (int): Number of repeats for permutation importance. top_features (int): Number of top features to plot based on permutation importance. n_estimators (int): Number of trees in the random forest, gradient boosting, or XGBoost model. test_size (float): Proportion of the dataset to include in the test split. random_state (int): Random seed for reproducibility. model_type (str): Type of model to use (‘random_forest’, ‘logistic_regression’, ‘gradient_boosting’, ‘xgboost’). n_jobs (int): Number of jobs to run in parallel for applicable models.
Returns: pandas.DataFrame: The original dataframe with added prediction and data usage columns. pandas.DataFrame: DataFrame containing the importances and standard deviations.
- spacr.ml.shap_analysis(model, X_train, X_test)[source]¶
Performs SHAP analysis on the given model and data.
Args: model: The trained model. X_train (pandas.DataFrame): Training feature set. X_test (pandas.DataFrame): Testing feature set. Returns: fig: Matplotlib figure object containing the SHAP summary plot.