CivisML Sparse Logistic

civis_ml_sparse_logistic(
  x,
  dependent_variable,
  primary_key = NULL,
  excluded_columns = NULL,
  penalty = c("l2", "l1"),
  dual = FALSE,
  tol = 1e-08,
  C = 499999950,
  fit_intercept = TRUE,
  intercept_scaling = 1,
  class_weight = NULL,
  random_state = 42,
  solver = c("liblinear", "newton-cg", "lbfgs", "sag"),
  max_iter = 100,
  multi_class = c("ovr", "multinomial"),
  fit_params = NULL,
  cross_validation_parameters = NULL,
  calibration = NULL,
  oos_scores_table = NULL,
  oos_scores_db = NULL,
  oos_scores_if_exists = c("fail", "append", "drop", "truncate"),
  model_name = NULL,
  cpu_requested = NULL,
  memory_requested = NULL,
  disk_requested = NULL,
  notifications = NULL,
  polling_interval = NULL,
  verbose = FALSE,
  civisml_version = "prod"
)

Arguments

x	See the Data Sources section below.
dependent_variable	The dependent variable of the training dataset. For a multi-target problem, this should be a vector of column names of dependent variables. Nulls in a single dependent variable will automatically be dropped.
primary_key	Optional, the unique ID (primary key) of the training dataset. This will be used to index the out-of-sample scores. In `predict.civis_ml`, the primary_key of the training task is used by default `primary_key = NA`. Use `primary_key = NULL` to explicitly indicate the data have no primary_key.
excluded_columns	Optional, a vector of columns which will be considered ineligible to be independent variables.
penalty	Used to specify the norm used in the penalization. The `newton-cg`, `sag`, and `lbfgs` solvers support only l2 penalties.
dual	Dual or primal formulation. Dual formulation is only implemented for `l2` penalty with the `liblinear` solver. `dual = FALSE` should be preferred when n_samples > n_features.
tol	Tolerance for stopping criteria.
C	Inverse of regularization strength, must be a positive float. Smaller values specify stronger regularization.
fit_intercept	Should a constant or intercept term be included in the model.
intercept_scaling	Useful only when the `solver = "liblinear"` and `fit_intercept = TRUE`. In this case, a constant term with the value `intercept_scaling` is added to the design matrix.
class_weight	A `list` with `class_label = value` pairs, or `balanced`. When `class_weight = "balanced"`, the class weights will be inversely proportional to the class frequencies in the input data as: $$ \frac{n_samples}{n_classes * table(y)} $$ Note, the class weights are multiplied with `sample_weight` (passed via `fit_params`) if `sample_weight` is specified.
random_state	The seed of the random number generator to use when shuffling the data. Used only in `solver = "sag"` and `solver = "liblinear"`.
solver	Algorithm to use in the optimization problem. For small data `liblinear` is a good choice. `sag` is faster for larger problems. For multiclass problems, only `newton-cg`, `sag`, and `lbfgs` handle multinomial loss. `liblinear` is limited to one-versus-rest schemes. `newton-cg`, `lbfgs`, and `sag` only handle the `l2` penalty. Note that `sag` fast convergence is only guaranteed on features with approximately the same scale.
max_iter	The maximum number of iterations taken for the solvers to converge. Useful for the `newton-cg`, `sag`, and `lbfgs` solvers.
multi_class	The scheme for multi-class problems. When `ovr`, then a binary problem is fit for each label. When `multinomial`, a single model is fit minimizing the multinomial loss. Note, `multinomial` only works with the `newton-cg`, `sag`, and `lbfgs` solvers.
fit_params	Optional, a mapping from parameter names in the model's `fit` method to the column names which hold the data, e.g. `list(sample_weight = 'survey_weight_column')`.
cross_validation_parameters	Optional, parameter grid for learner parameters, e.g. `list(n_estimators = c(100, 200, 500), learning_rate = c(0.01, 0.1), max_depth = c(2, 3))` or `"hyperband"` for supported models.
calibration	Optional, if not `NULL`, calibrate output probabilities with the selected method, `sigmoid`, or `isotonic`. Valid only with classification models.
oos_scores_table	Optional, if provided, store out-of-sample predictions on training set data to this Redshift "schema.tablename".
oos_scores_db	Optional, the name of the database where the `oos_scores_table` will be created. If not provided, this will default to `database_name`.
oos_scores_if_exists	Optional, action to take if `oos_scores_table` already exists. One of `"fail"`, `"append"`, `"drop"`, or `"truncate"`. The default is `"fail"`.
model_name	Optional, the prefix of the Platform modeling jobs. It will have `" Train"` or `" Predict"` added to become the Script title.
cpu_requested	Optional, the number of CPU shares requested in the Civis Platform for training jobs or prediction child jobs. 1024 shares = 1 CPU.
memory_requested	Optional, the memory requested from Civis Platform for training jobs or prediction child jobs, in MiB.
disk_requested	Optional, the disk space requested on Civis Platform for training jobs or prediction child jobs, in GB.
notifications	Optional, model status notifications. See `scripts_post_custom` for further documentation about email and URL notification.
polling_interval	Check for job completion every this number of seconds.
verbose	Optional, If `TRUE`, supply debug outputs in Platform logs and make prediction child jobs visible.
civisml_version	Optional, a one-length character vector of the CivisML version. The default is "prod", the latest version in production

Value

A civis_ml object, a list containing the following elements:

job

job metadata from scripts_get_custom.

run

run metadata from scripts_get_custom_runs.

outputs

CivisML metadata from scripts_list_custom_runs_outputs containing the locations of files produced by CivisML e.g. files, projects, metrics, model_info, logs, predictions, and estimators.

metrics

Parsed CivisML output from metrics.json containing metadata from validation. A list containing the following elements:

run list, metadata about the run.
data list, metadata about the training data.
model list, the fitted scikit-learn model with CV results.
metrics list, validation metrics (accuracy, confusion, ROC, AUC, etc).
warnings list.
data_platform list, training data location.

model_info

Parsed CivisML output from model_info.json containing metadata from training. A list containing the following elements:

run list, metadata about the run.
data list, metadata about the training data.
model list, the fitted scikit-learn model.
metrics empty list.
warnings list.
data_platform list, training data location.

Data Sources

For building models with civis_ml, the training data can reside in four different places, a file in the Civis Platform, a CSV or feather-format file on the local disk, a data.frame resident in local the R environment, and finally, a table in the Civis Platform. Use the following helpers to specify the data source when calling civis_ml:

data.frame

civis_ml(x = df, ...)

local csv file

civis_ml(x = "path/to/data.csv", ...)

file in Civis Platform

civis_ml(x = civis_file(1234))

table in Civis Platform

civis_ml(x = civis_table(table_name = "schema.table", database_name = "database"))

Examples

if (FALSE) {

df <- iris
names(df) <- gsub("\\.", "_", names(df))

m <- civis_ml_sparse_logistic(df, "Species")
yhat <- fetch_oos_scores(m)

# Grid Search
cv_params <- list(C = c(.01, 1, 10, 100, 1000))

m <- civis_ml_sparse_logistic(df, "Species",
  cross_validation_parameters = cv_params)

# make a prediction job, storing in a redshift table
pred_info <- predict(m, newdata = civis_table("schema.table", "my_database"),
   output_table = "schema.scores_table")

}