CivisML Sparse Linear Regression

civis_ml_sparse_linear_regressor(x, dependent_variable,
  primary_key = NULL, excluded_columns = NULL, fit_intercept = TRUE,
  normalize = FALSE, fit_params = NULL,
  cross_validation_parameters = NULL, oos_scores_table = NULL,
  oos_scores_db = NULL, oos_scores_if_exists = c("fail", "append",
  "drop", "truncate"), model_name = NULL, cpu_requested = NULL,
  memory_requested = NULL, disk_requested = NULL,
  notifications = NULL, polling_interval = NULL, verbose = FALSE)



See the Data Sources section below.


The dependent variable of the training dataset. For a multi-target problem, this should be a vector of column names of dependent variables. Nulls in a single dependent variable will automatically be dropped.


Optional, the unique ID (primary key) of the training dataset. This will be used to index the out-of-sample scores. In predict.civis_ml, the primary_key of the training task is used by default primary_key = NA. Use primary_key = NULL to explicitly indicate the data have no primary_key.


Optional, a vector of columns which will be considered ineligible to be independent variables.


Should an intercept term be included in the model. If FALSE, no intercept will be included, in this case the data are expected to already be centered.


If TRUE, the regressors will be normalized before fitting the model. normalize is ignored when fit_intercept = FALSE.


Optional, a mapping from parameter names in the model's fit method to the column names which hold the data, e.g. list(sample_weight = 'survey_weight_column').


Optional, parameter grid for learner parameters, e.g. list(n_estimators = c(100, 200, 500), learning_rate = c(0.01, 0.1), max_depth = c(2, 3)) or "hyperband" for supported models.


Optional, if provided, store out-of-sample predictions on training set data to this Redshift "schema.tablename".


Optional, the name of the database where the oos_scores_table will be created. If not provided, this will default to database_name.


Optional, action to take if oos_scores_table already exists. One of "fail", "append", "drop", or "truncate". The default is "fail".


Optional, the prefix of the Platform modeling jobs. It will have " Train" or " Predict" added to become the Script title.


Optional, the number of CPU shares requested in the Civis Platform for training jobs or prediction child jobs. 1024 shares = 1 CPU.


Optional, the memory requested from Civis Platform for training jobs or prediction child jobs, in MiB.


Optional, the disk space requested on Civis Platform for training jobs or prediction child jobs, in GB.


Optional, model status notifications. See scripts_post_custom for further documentation about email and URL notification.


Check for job completion every this number of seconds.


Optional, If TRUE, supply debug outputs in Platform logs and make prediction child jobs visible.


A civis_ml object, a list containing the following elements:


job metadata from scripts_get_custom.


run metadata from scripts_get_custom_runs.


CivisML metadata from scripts_list_custom_runs_outputs containing the locations of files produced by CivisML e.g. files, projects, metrics, model_info, logs, predictions, and estimators.


Parsed CivisML output from metrics.json containing metadata from validation. A list containing the following elements:

  • run list, metadata about the run.

  • data list, metadata about the training data.

  • model list, the fitted scikit-learn model with CV results.

  • metrics list, validation metrics (accuracy, confusion, ROC, AUC, etc).

  • warnings list.

  • data_platform list, training data location.


Parsed CivisML output from model_info.json containing metadata from training. A list containing the following elements:

  • run list, metadata about the run.

  • data list, metadata about the training data.

  • model list, the fitted scikit-learn model.

  • metrics empty list.

  • warnings list.

  • data_platform list, training data location.

Data Sources

For building models with civis_ml, the training data can reside in four different places, a file in the Civis Platform, a CSV or feather-format file on the local disk, a data.frame resident in local the R environment, and finally, a table in the Civis Platform. Use the following helpers to specify the data source when calling civis_ml:


civis_ml(x = df, ...)

local csv file

civis_ml(x = "path/to/data.csv", ...)

file in Civis Platform

civis_ml(x = civis_file(1234))

table in Civis Platform

civis_ml(x = civis_table(table_name = "schema.table", database_name = "database"))


 m <- civis_ml_sparse_linear_regressor(ChickWeight, dependent_variable = "weight")
 yhat <- fetch_oos_scores(m)

# make a prediction job, storing in a redshift table
pred_info <- predict(m, newdata = civis_table("schema.table", "my_database"),
   output_table = "schema.scores_table")

# }