CivisML Sparse Logistic

civis_ml_sparse_logistic(x, dependent_variable, primary_key = NULL,
  excluded_columns = NULL, penalty = c("l2", "l1"), dual = FALSE,
  tol = 1e-08, C = 499999950, fit_intercept = TRUE,
  intercept_scaling = 1, class_weight = NULL, random_state = 42,
  solver = c("liblinear", "newton-cg", "lbfgs", "sag"), max_iter = 100,
  multi_class = c("ovr", "multinomial"), fit_params = NULL,
  cross_validation_parameters = NULL, calibration = NULL,
  oos_scores_table = NULL, oos_scores_db = NULL,
  oos_scores_if_exists = c("fail", "append", "drop", "truncate"),
  model_name = NULL, cpu_requested = NULL, memory_requested = NULL,
  disk_requested = NULL, notifications = NULL,
  polling_interval = NULL, verbose = FALSE)

Arguments

x

See the Data Sources section below.

dependent_variable

The dependent variable of the training dataset. For a multi-target problem, this should be a vector of column names of dependent variables. Nulls in a single dependent variable will automatically be dropped.

primary_key

Optional, the unique ID (primary key) of the training dataset. This will be used to index the out-of-sample scores. In predict.civis_ml, the primary_key of the training task is used by default primary_key = NA. Use primary_key = NULL to explicitly indicate the data have no primary_key.

excluded_columns

Optional, a vector of columns which will be considered ineligible to be independent variables.

penalty

Used to specify the norm used in the penalization. The newton-cg, sag, and lbfgs solvers support only l2 penalties.

dual

Dual or primal formulation. Dual formulation is only implemented for l2 penalty with the liblinear solver. dual = FALSE should be preferred when n_samples > n_features.

tol

Tolerance for stopping criteria.

C

Inverse of regularization strength, must be a positive float. Smaller values specify stronger regularization.

fit_intercept

Should a constant or intercept term be included in the model.

intercept_scaling

Useful only when the solver = "liblinear" and fit_intercept = TRUE. In this case, a constant term with the value intercept_scaling is added to the design matrix.

class_weight

A list with class_label = value pairs, or balanced. When class_weight = "balanced", the class weights will be inversely proportional to the class frequencies in the input data as: $$ \frac{n_samples}{n_classes * table(y)} $$

Note, the class weights are multiplied with sample_weight (passed via fit_params) if sample_weight is specified.

random_state

The seed of the random number generator to use when shuffling the data. Used only in solver = "sag" and solver = "liblinear".

solver

Algorithm to use in the optimization problem. For small data liblinear is a good choice. sag is faster for larger problems. For multiclass problems, only newton-cg, sag, and lbfgs handle multinomial loss. liblinear is limited to one-versus-rest schemes. newton-cg, lbfgs, and sag only handle the l2 penalty.

Note that sag fast convergence is only guaranteed on features with approximately the same scale.

max_iter

The maximum number of iterations taken for the solvers to converge. Useful for the newton-cg, sag, and lbfgs solvers.

multi_class

The scheme for multi-class problems. When ovr, then a binary problem is fit for each label. When multinomial, a single model is fit minimizing the multinomial loss. Note, multinomial only works with the newton-cg, sag, and lbfgs solvers.

fit_params

Optional, a mapping from parameter names in the model's fit method to the column names which hold the data, e.g. list(sample_weight = 'survey_weight_column').

cross_validation_parameters

Optional, parameter grid for learner parameters, e.g. list(n_estimators = c(100, 200, 500), learning_rate = c(0.01, 0.1), max_depth = c(2, 3)) or "hyperband" for supported models.

calibration

Optional, if not NULL, calibrate output probabilities with the selected method, sigmoid, or isotonic. Valid only with classification models.

oos_scores_table

Optional, if provided, store out-of-sample predictions on training set data to this Redshift "schema.tablename".

oos_scores_db

Optional, the name of the database where the oos_scores_table will be created. If not provided, this will default to database_name.

oos_scores_if_exists

Optional, action to take if oos_scores_table already exists. One of "fail", "append", "drop", or "truncate". The default is "fail".

model_name

Optional, the prefix of the Platform modeling jobs. It will have " Train" or " Predict" added to become the Script title.

cpu_requested

Optional, the number of CPU shares requested in the Civis Platform for training jobs or prediction child jobs. 1024 shares = 1 CPU.

memory_requested

Optional, the memory requested from Civis Platform for training jobs or prediction child jobs, in MiB.

disk_requested

Optional, the disk space requested on Civis Platform for training jobs or prediction child jobs, in GB.

notifications

Optional, model status notifications. See scripts_post_custom for further documentation about email and URL notification.

polling_interval

Check for job completion every this number of seconds.

verbose

Optional, If TRUE, supply debug outputs in Platform logs and make prediction child jobs visible.

Value

A civis_ml object, a list containing the following elements:

job

job metadata from scripts_get_custom.

run

run metadata from scripts_get_custom_runs.

outputs

CivisML metadata from scripts_list_custom_runs_outputs containing the locations of files produced by CivisML e.g. files, projects, metrics, model_info, logs, predictions, and estimators.

metrics

Parsed CivisML output from metrics.json containing metadata from validation. A list containing the following elements:

  • run list, metadata about the run.

  • data list, metadata about the training data.

  • model list, the fitted scikit-learn model with CV results.

  • metrics list, validation metrics (accuracy, confusion, ROC, AUC, etc).

  • warnings list.

  • data_platform list, training data location.

model_info

Parsed CivisML output from model_info.json containing metadata from training. A list containing the following elements:

  • run list, metadata about the run.

  • data list, metadata about the training data.

  • model list, the fitted scikit-learn model.

  • metrics empty list.

  • warnings list.

  • data_platform list, training data location.

Data Sources

For building models with civis_ml, the training data can reside in four different places, a file in the Civis Platform, a CSV or feather-format file on the local disk, a data.frame resident in local the R environment, and finally, a table in the Civis Platform. Use the following helpers to specify the data source when calling civis_ml:

data.frame

civis_ml(x = df, ...)

local csv file

civis_ml(x = "path/to/data.csv", ...)

file in Civis Platform

civis_ml(x = civis_file(1234))

table in Civis Platform

civis_ml(x = civis_table(table_name = "schema.table", database_name = "database"))

Examples

# NOT RUN {
df <- iris
names(df) <- gsub("\\.", "_", names(df))

m <- civis_ml_sparse_logistic(df, "Species")
yhat <- fetch_oos_scores(m)

# Grid Search
cv_params <- list(C = c(.01, 1, 10, 100, 1000))

m <- civis_ml_sparse_logistic(df, "Species",
  cross_validation_parameters = cv_params)

# make a prediction job, storing in a redshift table
pred_info <- predict(m, newdata = civis_table("schema.table", "my_database"),
   output_table = "schema.scores_table")

# }