CivisML Sparse Ridge Regression

civis_ml_sparse_ridge_regressor(
  x,
  dependent_variable,
  primary_key = NULL,
  excluded_columns = NULL,
  alpha = 1,
  fit_intercept = TRUE,
  normalize = FALSE,
  max_iter = NULL,
  tol = 0.001,
  solver = c("auto", "svd", "cholesky", "lsqr", "sparse_cg", "sag"),
  random_state = 42,
  fit_params = NULL,
  cross_validation_parameters = NULL,
  oos_scores_table = NULL,
  oos_scores_db = NULL,
  oos_scores_if_exists = c("fail", "append", "drop", "truncate"),
  model_name = NULL,
  cpu_requested = NULL,
  memory_requested = NULL,
  disk_requested = NULL,
  notifications = NULL,
  polling_interval = NULL,
  verbose = FALSE,
  civisml_version = "prod"
)

Arguments

x	See the Data Sources section below.
dependent_variable	The dependent variable of the training dataset. For a multi-target problem, this should be a vector of column names of dependent variables. Nulls in a single dependent variable will automatically be dropped.
primary_key	Optional, the unique ID (primary key) of the training dataset. This will be used to index the out-of-sample scores. In `predict.civis_ml`, the primary_key of the training task is used by default `primary_key = NA`. Use `primary_key = NULL` to explicitly indicate the data have no primary_key.
excluded_columns	Optional, a vector of columns which will be considered ineligible to be independent variables.
alpha	The regularization strength, must be a vector of floats of length n_targets or a single float. Larger values specify stronger regularization.
fit_intercept	Should an intercept term be included in the model. If `FALSE`, no intercept will be included, in this case the data are expected to already be centered.
normalize	If `TRUE`, the regressors will be normalized before fitting the model. `normalize` is ignored when `fit_intercept = FALSE`.
max_iter	Maximum number of iterations for conjugate gradient solver. For `sparse_cg` and `lsqr` solvers, the default value is predetermined. For the `sag` solver, the default value is 1000.
tol	Precision of the solution.
solver	Solver to use for the optimization problem. auto chooses the solver automatically based on the type of data. svd uses Singular Value Decomposition of X to compute the Ridge coefficients. More stable for singular matrices than `cholesky`. cholesky uses the standard decomposition to obtain a closed-form solution. sparse_cg uses the conjugate gradient solver. As an iterative algorithm, this solver is more appropriate than `cholesky` for large-scale data. lsqr uses the dedicated regularized least-squares routine. sag uses Stochastic Average Gradient descent. It also uses an iterative procedure, and is often faster than other solvers when both n_samples and n_features are large. Note that `sag` fast convergence is only guaranteed on features with approximately the same scale
random_state	The seed of the pseudo random number generator to use when shuffling the data. Used only when `solver = "sag"`.
fit_params	Optional, a mapping from parameter names in the model's `fit` method to the column names which hold the data, e.g. `list(sample_weight = 'survey_weight_column')`.
cross_validation_parameters	Optional, parameter grid for learner parameters, e.g. `list(n_estimators = c(100, 200, 500), learning_rate = c(0.01, 0.1), max_depth = c(2, 3))` or `"hyperband"` for supported models.
oos_scores_table	Optional, if provided, store out-of-sample predictions on training set data to this Redshift "schema.tablename".
oos_scores_db	Optional, the name of the database where the `oos_scores_table` will be created. If not provided, this will default to `database_name`.
oos_scores_if_exists	Optional, action to take if `oos_scores_table` already exists. One of `"fail"`, `"append"`, `"drop"`, or `"truncate"`. The default is `"fail"`.
model_name	Optional, the prefix of the Platform modeling jobs. It will have `" Train"` or `" Predict"` added to become the Script title.
cpu_requested	Optional, the number of CPU shares requested in the Civis Platform for training jobs or prediction child jobs. 1024 shares = 1 CPU.
memory_requested	Optional, the memory requested from Civis Platform for training jobs or prediction child jobs, in MiB.
disk_requested	Optional, the disk space requested on Civis Platform for training jobs or prediction child jobs, in GB.
notifications	Optional, model status notifications. See `scripts_post_custom` for further documentation about email and URL notification.
polling_interval	Check for job completion every this number of seconds.
verbose	Optional, If `TRUE`, supply debug outputs in Platform logs and make prediction child jobs visible.
civisml_version	Optional, a one-length character vector of the CivisML version. The default is "prod", the latest version in production

Value

A civis_ml object, a list containing the following elements:

job

job metadata from scripts_get_custom.

run

run metadata from scripts_get_custom_runs.

outputs

CivisML metadata from scripts_list_custom_runs_outputs containing the locations of files produced by CivisML e.g. files, projects, metrics, model_info, logs, predictions, and estimators.

metrics

Parsed CivisML output from metrics.json containing metadata from validation. A list containing the following elements:

run list, metadata about the run.
data list, metadata about the training data.
model list, the fitted scikit-learn model with CV results.
metrics list, validation metrics (accuracy, confusion, ROC, AUC, etc).
warnings list.
data_platform list, training data location.

model_info

Parsed CivisML output from model_info.json containing metadata from training. A list containing the following elements:

run list, metadata about the run.
data list, metadata about the training data.
model list, the fitted scikit-learn model.
metrics empty list.
warnings list.
data_platform list, training data location.

Data Sources

For building models with civis_ml, the training data can reside in four different places, a file in the Civis Platform, a CSV or feather-format file on the local disk, a data.frame resident in local the R environment, and finally, a table in the Civis Platform. Use the following helpers to specify the data source when calling civis_ml:

data.frame

civis_ml(x = df, ...)

local csv file

civis_ml(x = "path/to/data.csv", ...)

file in Civis Platform

civis_ml(x = civis_file(1234))

table in Civis Platform

civis_ml(x = civis_table(table_name = "schema.table", database_name = "database"))

Examples

if (FALSE) {
 data(ChickWeight)
 m <- civis_ml_sparse_ridge_regressor(ChickWeight, dependent_variable = "weight", alpha = 999)
 yhat <- fetch_oos_scores(m)

 # Grid search
 cv_params <- list(alpha = c(.001, .01, .1, 1))
 m <- civis_ml_sparse_ridge_regressor(ChickWeight,
   dependent_variable = "weight",
   cross_validation_parameters = cv_params,
   calibration = NULL)

# make a prediction job, storing in a redshift table
pred_info <- predict(m, newdata = civis_table("schema.table", "my_database"),
   output_table = "schema.scores_table")
}