Interface for modeling in the Civis Platform

An interface for training and scoring data on Civis Platform using a set of Scikit-Learn estimators.

civis_ml(
  x,
  dependent_variable,
  model_type,
  primary_key = NULL,
  excluded_columns = NULL,
  parameters = NULL,
  fit_params = NULL,
  cross_validation_parameters = NULL,
  calibration = NULL,
  oos_scores_table = NULL,
  oos_scores_db = NULL,
  oos_scores_if_exists = c("fail", "append", "drop", "truncate"),
  model_name = NULL,
  cpu_requested = NULL,
  memory_requested = NULL,
  disk_requested = NULL,
  notifications = NULL,
  polling_interval = NULL,
  validation_data = c("train", "skip"),
  n_jobs = NULL,
  verbose = FALSE,
  civisml_version = "prod"
)

civis_ml_fetch_existing(model_id, run_id = NULL)

# S3 method for civis_ml
predict(
  object,
  newdata,
  primary_key = NA,
  output_table = NULL,
  output_db = NULL,
  if_output_exists = c("fail", "append", "drop", "truncate"),
  n_jobs = NULL,
  cpu_requested = NULL,
  memory_requested = NULL,
  disk_requested = NULL,
  polling_interval = NULL,
  verbose = FALSE,
  dvs_to_predict = NULL,
  ...
)

Arguments

x, newdata	See the Data Sources section below.
dependent_variable	The dependent variable of the training dataset. For a multi-target problem, this should be a vector of column names of dependent variables. Nulls in a single dependent variable will automatically be dropped.
model_type	The name of the CivisML workflow. See the Workflows section below.
primary_key	Optional, the unique ID (primary key) of the training dataset. This will be used to index the out-of-sample scores. In `predict.civis_ml`, the primary_key of the training task is used by default `primary_key = NA`. Use `primary_key = NULL` to explicitly indicate the data have no primary_key.
excluded_columns	Optional, a vector of columns which will be considered ineligible to be independent variables.
parameters	Optional, parameters for the final stage estimator in a predefined model, e.g. `list(C = 2)` for a "sparse_logistic" model.
fit_params	Optional, a mapping from parameter names in the model's `fit` method to the column names which hold the data, e.g. `list(sample_weight = 'survey_weight_column')`.
cross_validation_parameters	Optional, parameter grid for learner parameters, e.g. `list(n_estimators = c(100, 200, 500), learning_rate = c(0.01, 0.1), max_depth = c(2, 3))` or `"hyperband"` for supported models.
calibration	Optional, if not `NULL`, calibrate output probabilities with the selected method, `sigmoid`, or `isotonic`. Valid only with classification models.
oos_scores_table	Optional, if provided, store out-of-sample predictions on training set data to this Redshift "schema.tablename".
oos_scores_db	Optional, the name of the database where the `oos_scores_table` will be created. If not provided, this will default to `database_name`.
oos_scores_if_exists	Optional, action to take if `oos_scores_table` already exists. One of `"fail"`, `"append"`, `"drop"`, or `"truncate"`. The default is `"fail"`.
model_name	Optional, the prefix of the Platform modeling jobs. It will have `" Train"` or `" Predict"` added to become the Script title.
cpu_requested	Optional, the number of CPU shares requested in the Civis Platform for training jobs or prediction child jobs. 1024 shares = 1 CPU.
memory_requested	Optional, the memory requested from Civis Platform for training jobs or prediction child jobs, in MiB.
disk_requested	Optional, the disk space requested on Civis Platform for training jobs or prediction child jobs, in GB.
notifications	Optional, model status notifications. See `scripts_post_custom` for further documentation about email and URL notification.
polling_interval	Check for job completion every this number of seconds.
validation_data	Optional, source for validation data. There are currently two options: `train` (the default), which uses training data for validation, and `skip`, which skips the validation step.
n_jobs	Number of concurrent Platform jobs to use for training and validation, or multi-file / large table prediction. Defaults to `NULL`, which allows CivisML to dynamically calculate an appropriate number of workers to use (in general, as many as possible without using all resources in the cluster).
verbose	Optional, If `TRUE`, supply debug outputs in Platform logs and make prediction child jobs visible.
civisml_version	Optional, a one-length character vector of the CivisML version. The default is "prod", the latest version in production
model_id	The `id` of CivisML model built previously.
run_id	Optional, the `id` of a CivisML model run. If `NULL`, defaults to fetching the latest run.
object	A `civis_ml` object.
output_table	The table in which to put predictions.
output_db	The database containing `output_table`. If not provided, this will default to the `database_name` specified when the model was built.
if_output_exists	Action to take if the prediction table already exists. One of `"fail"`, `"append"`, `"drop"`, or `"truncate"`. The default is `"fail"`.
dvs_to_predict	Optional, For scoring, this should be a vector of column names of dependent variables to include in the output table. It must be a subset of the `dependent_variable` vector provided for training. The scores for the returned subset will be identical to the scores which those outputs would have had if all outputs were written, but ignoring some of the model's outputs will let predictions complete faster and use less disk space. If not provided, the entire model output will be written to the output table.
...	Unused

Value

A civis_ml object, a list containing the following elements:

job

job metadata from scripts_get_custom.

run

run metadata from scripts_get_custom_runs.

outputs

CivisML metadata from scripts_list_custom_runs_outputs containing the locations of files produced by CivisML e.g. files, projects, metrics, model_info, logs, predictions, and estimators.

metrics

Parsed CivisML output from metrics.json containing metadata from validation. A list containing the following elements:

run list, metadata about the run.
data list, metadata about the training data.
model list, the fitted scikit-learn model with CV results.
metrics list, validation metrics (accuracy, confusion, ROC, AUC, etc).
warnings list.
data_platform list, training data location.

model_info

Parsed CivisML output from model_info.json containing metadata from training. A list containing the following elements:

run list, metadata about the run.
data list, metadata about the training data.
model list, the fitted scikit-learn model.
metrics empty list.
warnings list.
data_platform list, training data location.

CivisML Workflows

You can use the following pre-defined models with civis_ml. All models start by imputing missing values with the mean of non-null values in a column. The "sparse_*" models include a LASSO regression step (using glmnet) to do feature selection before passing data to the final model. In some models, CivisML uses default parameters from those in Scikit-Learn, as indicated in the "Altered Defaults" column. All models also have random_state=42.

Specific workflows can also be called directly using the R workflow functions.

Name	R Workflow	Model Type	Algorithm	Altered Defaults
`sparse_logistic`	`civis_ml_sparse_logistic`	classification	LogisticRegression	`C=499999950, tol=1e-08`
`gradient_boosting_classifier`	`civis_ml_gradient_boosting_classifier`	classification	GradientBoostingClassifier	`n_estimators=500, max_depth=2`
`random_forest_classifier`	`civis_ml_random_forest_classifier`	classification	RandomForestClassifier	`n_estimators=500`
`extra_trees_classifier`	`civis_ml_extra_trees_classifier`	classification	ExtraTreesClassifier	`n_estimators=500`
`multilayer_perceptron_classifier`		classification	muffnn.MLPClassifier
`stacking_classifier`		classification	StackedClassifier
`sparse_linear_regressor`	`civis_ml_sparse_linear_regressor`	regression	LinearRegression
`sparse_ridge_regressor`	`civis_ml_sparse_ridge_regressor`	regression	Ridge
`gradient_boosting_regressor`	`civis_ml_gradient_boosting_regressor`	regression	GradientBoostingRegressor	`n_estimators=500, max_depth=2`
`random_forest_regressor`	`civis_ml_random_forest_regressor`	regression	RandomForestRegressor	`n_estimators=500`
`extra_trees_regressor`	`civis_ml_extra_trees_regressor`	regression	ExtraTreesRegressor	`n_estimators=500`
`multilayer_perceptron_regressor`		regression	muffnn.MLPRegressor
`stacking_regressor`		regression	StackedRegressor

Model names can be easily accessed using the global variables CIVIS_ML_REGRESSORS and CIVIS_ML_CLASSIFIERS.

Stacking

The "stacking_classifier" model stacks together the "gradient_boosting_classifier" and "random_forest_classifier" predefined models together with a glmnet.LogitNet(alpha=0, n_splits=4, max_iter=10000, tol=1e-5, scoring='log_loss'). Defaults for the predefined models are documented in ?civis_ml. Each column is first standardized, and then the model predictions are combined using LogisticRegressionCV with penalty='l2' and tol=1e-08. The "stacking_regressor" works similarly, stacking together the "gradient_boosting_regressor" and "random_forest_regressor" models and a glmnet.ElasticNet(alpha=0, n_splits=4, max_iter=10000, tol=1e-5, scoring='r2'), combining them using NonNegativeLinearRegression. The estimators that are being stacked have the same names as the associated pre-defined models, and the meta-estimator steps are named "meta-estimator". Note that although default parameters are provided for multilayer perceptron models, it is highly recommended that multilayer perceptrons be run using hyperband.

Hyperparameter Tuning

You can tune hyperparameters using one of two methods: grid search or hyperband. CivisML will perform grid search if you pass a list of hyperparameters to the cross_validation_parameters parameter, where list elements are hyperparameter names, and the values are vectors of hyperparameter values to grid search over. You can run hyperparameter optimization in parallel by setting the n_jobs parameter to however many jobs you would like to run in parallel. By default, n_jobs is dynamically calculated based on the resources available on your cluster, such that a modeling job will never take up more than 90

Hyperband is an efficient approach to hyperparameter optimization, and recommended over grid search where possible. CivisML will perform hyperband optimization if you pass the string "hyperband" to cross_validation_parameters. Hyperband is currently only supported for the following models: "gradient_boosting_classifier", "random_forest_classifier", "extra_trees_classifier", "multilayer_perceptron_classifier", "stacking_classifier", "gradient_boosting_regressor", "random_forest_regressor", "extra_trees_regressor", "multilayer_perceptron_regressor", and "stacking_regressor".

Hyperband cannot be used to tune GLMs. For this reason, preset GLMs do not have a hyperband option. Similarly, when cross_validation_parameters='hyperband' and the model is stacking_classifier or stacking_regressor, only the GBT and random forest steps of the stacker are tuned using hyperband. For the specific distributions used in the predefined hyperband models, see the detailed table in the Python client documentation.

Data Sources

For building models with civis_ml, the training data can reside in four different places, a file in the Civis Platform, a CSV or feather-format file on the local disk, a data.frame resident in local the R environment, and finally, a table in the Civis Platform. Use the following helpers to specify the data source when calling civis_ml:

data.frame

civis_ml(x = df, ...)

local csv file

civis_ml(x = "path/to/data.csv", ...)

file in Civis Platform

civis_ml(x = civis_file(1234))

table in Civis Platform

civis_ml(x = civis_table(table_name = "schema.table", database_name = "database"))

Out of sample scores

Model outputs will always contain out-of-sample (or out of fold) scores, which are accessible through fetch_oos_scores. These may be stored in a Civis table on Redshift using the oos_scores, oos_scores_db, and oos_scores_if_exists parameters.

Predictions

A fitted model can be used to make predictions for data residing in any of the sources above and a civis_file_manifest. Similar to civis_ml, use the data source helpers as the newdata argument to predict.civis_ml.

A manifest file is a JSON file which specifies the location of many shards of the data to be used for prediction. A manifest file is the output of a Civis export job with force_multifile = TRUE set, e.g. from civis_to_multifile_csv. Large civis tables (provided using table_name) will automatically be exported to manifest files.

Prediction outputs will always be stored as gzipped CSVs in one or more civis files. Provide an output_table (and optionally an output_db, if it's different from database_name) to copy these predictions into a table on Redshift.

Examples

if (FALSE) {
# From a data frame:
m <- civis_ml(df, model_type = "sparse_logistic",
              dependent_variable = "Species")

# From a table:
m <- civis_ml(civis_table("schema.table", "database_name"),
              model_type = "sparse_logistic", dependent_variable = "Species",
              oos_scores_table = "schema.scores_table",
              oos_scores_if_exists = "drop")

# From a local file:
m <- civis_ml("path/to/file.csv", model_type = "sparse_logistic",
              dependent_variable = "Species")

# From a Civis file:
file_id <- write_civis_file("path/to/file.csv", name = "file.csv")
m <- civis_ml(civis_file(file_id), model_type = "sparse_logistic",
              dependent_variable = "Species")

pred_job <- predict(m, newdata = df)
pred_job <- predict(m, civis_table("schema.table", "database_name"),
                    output_table = "schema.scores_table")
pred_job <- predict(m, civis_file(file_id),
                    output_table = "schema.scores_table")

m <- civis_ml_fetch_existing(model_id = m$job$id, m$run$id)
logs <- fetch_logs(m)
yhat <- fetch_oos_scores(m)
yhat <- fetch_predictions(pred_job)
}