CivisML Sparse Logistic
civis_ml_sparse_logistic( x, dependent_variable, primary_key = NULL, excluded_columns = NULL, penalty = c("l2", "l1"), dual = FALSE, tol = 1e-08, C = 499999950, fit_intercept = TRUE, intercept_scaling = 1, class_weight = NULL, random_state = 42, solver = c("liblinear", "newton-cg", "lbfgs", "sag"), max_iter = 100, multi_class = c("ovr", "multinomial"), fit_params = NULL, cross_validation_parameters = NULL, calibration = NULL, oos_scores_table = NULL, oos_scores_db = NULL, oos_scores_if_exists = c("fail", "append", "drop", "truncate"), model_name = NULL, cpu_requested = NULL, memory_requested = NULL, disk_requested = NULL, notifications = NULL, polling_interval = NULL, verbose = FALSE, civisml_version = "prod" )
x | See the Data Sources section below. |
---|---|
dependent_variable | The dependent variable of the training dataset. For a multi-target problem, this should be a vector of column names of dependent variables. Nulls in a single dependent variable will automatically be dropped. |
primary_key | Optional, the unique ID (primary key) of the training
dataset. This will be used to index the out-of-sample scores. In
|
excluded_columns | Optional, a vector of columns which will be considered ineligible to be independent variables. |
penalty | Used to specify the norm used in the penalization. The
|
dual | Dual or primal formulation. Dual formulation is only implemented
for |
tol | Tolerance for stopping criteria. |
C | Inverse of regularization strength, must be a positive float. Smaller values specify stronger regularization. |
fit_intercept | Should a constant or intercept term be included in the model. |
intercept_scaling | Useful only when the |
class_weight | A Note, the class weights are multiplied with |
random_state | The seed of the random number generator to use when
shuffling the data. Used only in |
solver | Algorithm to use in the optimization problem. For small data
Note that |
max_iter | The maximum number of iterations taken for the solvers to
converge. Useful for the |
multi_class | The scheme for multi-class problems. When |
fit_params | Optional, a mapping from parameter names in the model's
|
cross_validation_parameters | Optional, parameter grid for learner
parameters, e.g. |
calibration | Optional, if not |
oos_scores_table | Optional, if provided, store out-of-sample predictions on training set data to this Redshift "schema.tablename". |
oos_scores_db | Optional, the name of the database where the
|
oos_scores_if_exists | Optional, action to take if
|
model_name | Optional, the prefix of the Platform modeling jobs.
It will have |
cpu_requested | Optional, the number of CPU shares requested in the Civis Platform for training jobs or prediction child jobs. 1024 shares = 1 CPU. |
memory_requested | Optional, the memory requested from Civis Platform for training jobs or prediction child jobs, in MiB. |
disk_requested | Optional, the disk space requested on Civis Platform for training jobs or prediction child jobs, in GB. |
notifications | Optional, model status notifications. See
|
polling_interval | Check for job completion every this number of seconds. |
verbose | Optional, If |
civisml_version | Optional, a one-length character vector of the CivisML version. The default is "prod", the latest version in production |
A civis_ml
object, a list containing the following elements:
job metadata from scripts_get_custom
.
run metadata from scripts_get_custom_runs
.
CivisML metadata from scripts_list_custom_runs_outputs
containing the locations of
files produced by CivisML e.g. files, projects, metrics, model_info, logs, predictions, and estimators.
Parsed CivisML output from metrics.json
containing metadata from validation.
A list containing the following elements:
run list, metadata about the run.
data list, metadata about the training data.
model list, the fitted scikit-learn model with CV results.
metrics list, validation metrics (accuracy, confusion, ROC, AUC, etc).
warnings list.
data_platform list, training data location.
Parsed CivisML output from model_info.json
containing metadata from training.
A list containing the following elements:
run list, metadata about the run.
data list, metadata about the training data.
model list, the fitted scikit-learn model.
metrics empty list.
warnings list.
data_platform list, training data location.
For building models with civis_ml
, the training data can reside in
four different places, a file in the Civis Platform, a CSV or feather-format file
on the local disk, a data.frame
resident in local the R environment, and finally,
a table in the Civis Platform. Use the following helpers to specify the
data source when calling civis_ml
:
data.frame
civis_ml(x = civis_table(table_name = "schema.table", database_name = "database"))
if (FALSE) { df <- iris names(df) <- gsub("\\.", "_", names(df)) m <- civis_ml_sparse_logistic(df, "Species") yhat <- fetch_oos_scores(m) # Grid Search cv_params <- list(C = c(.01, 1, 10, 100, 1000)) m <- civis_ml_sparse_logistic(df, "Species", cross_validation_parameters = cv_params) # make a prediction job, storing in a redshift table pred_info <- predict(m, newdata = civis_table("schema.table", "my_database"), output_table = "schema.scores_table") }