An interface for training and scoring data on Civis Platform using a set of Scikit-Learn estimators.
civis_ml( x, dependent_variable, model_type, primary_key = NULL, excluded_columns = NULL, parameters = NULL, fit_params = NULL, cross_validation_parameters = NULL, calibration = NULL, oos_scores_table = NULL, oos_scores_db = NULL, oos_scores_if_exists = c("fail", "append", "drop", "truncate"), model_name = NULL, cpu_requested = NULL, memory_requested = NULL, disk_requested = NULL, notifications = NULL, polling_interval = NULL, validation_data = c("train", "skip"), n_jobs = NULL, verbose = FALSE, civisml_version = "prod" ) civis_ml_fetch_existing(model_id, run_id = NULL) # S3 method for civis_ml predict( object, newdata, primary_key = NA, output_table = NULL, output_db = NULL, if_output_exists = c("fail", "append", "drop", "truncate"), n_jobs = NULL, cpu_requested = NULL, memory_requested = NULL, disk_requested = NULL, polling_interval = NULL, verbose = FALSE, dvs_to_predict = NULL, ... )
x, newdata | See the Data Sources section below. |
---|---|
dependent_variable | The dependent variable of the training dataset. For a multi-target problem, this should be a vector of column names of dependent variables. Nulls in a single dependent variable will automatically be dropped. |
model_type | The name of the CivisML workflow. See the Workflows section below. |
primary_key | Optional, the unique ID (primary key) of the training
dataset. This will be used to index the out-of-sample scores. In
|
excluded_columns | Optional, a vector of columns which will be considered ineligible to be independent variables. |
parameters | Optional, parameters for the final stage estimator in a
predefined model, e.g. |
fit_params | Optional, a mapping from parameter names in the model's
|
cross_validation_parameters | Optional, parameter grid for learner
parameters, e.g. |
calibration | Optional, if not |
oos_scores_table | Optional, if provided, store out-of-sample predictions on training set data to this Redshift "schema.tablename". |
oos_scores_db | Optional, the name of the database where the
|
oos_scores_if_exists | Optional, action to take if
|
model_name | Optional, the prefix of the Platform modeling jobs.
It will have |
cpu_requested | Optional, the number of CPU shares requested in the Civis Platform for training jobs or prediction child jobs. 1024 shares = 1 CPU. |
memory_requested | Optional, the memory requested from Civis Platform for training jobs or prediction child jobs, in MiB. |
disk_requested | Optional, the disk space requested on Civis Platform for training jobs or prediction child jobs, in GB. |
notifications | Optional, model status notifications. See
|
polling_interval | Check for job completion every this number of seconds. |
validation_data | Optional, source for validation data. There are
currently two options: |
n_jobs | Number of concurrent Platform jobs to use for training and
validation, or multi-file / large table prediction. Defaults to
|
verbose | Optional, If |
civisml_version | Optional, a one-length character vector of the CivisML version. The default is "prod", the latest version in production |
model_id | The |
run_id | Optional, the |
object | A |
output_table | The table in which to put predictions. |
output_db | The database containing |
if_output_exists | Action to take if the prediction table already exists. One of |
dvs_to_predict | Optional, For scoring, this should be a vector of column
names of dependent variables to include in the output table. It must be a
subset of the |
... | Unused |
A civis_ml
object, a list containing the following elements:
job metadata from scripts_get_custom
.
run metadata from scripts_get_custom_runs
.
CivisML metadata from scripts_list_custom_runs_outputs
containing the locations of
files produced by CivisML e.g. files, projects, metrics, model_info, logs, predictions, and estimators.
Parsed CivisML output from metrics.json
containing metadata from validation.
A list containing the following elements:
run list, metadata about the run.
data list, metadata about the training data.
model list, the fitted scikit-learn model with CV results.
metrics list, validation metrics (accuracy, confusion, ROC, AUC, etc).
warnings list.
data_platform list, training data location.
Parsed CivisML output from model_info.json
containing metadata from training.
A list containing the following elements:
run list, metadata about the run.
data list, metadata about the training data.
model list, the fitted scikit-learn model.
metrics empty list.
warnings list.
data_platform list, training data location.
You can use the following pre-defined models with civis_ml
. All models
start by imputing missing values with the mean of non-null values in a
column. The "sparse_*"
models include a LASSO regression step
(using glmnet
) to do feature selection before passing data to the
final model. In some models, CivisML uses default parameters from those in
Scikit-Learn, as indicated in the "Altered Defaults" column.
All models also have random_state=42
.
Specific workflows can also be called directly using the R workflow functions.
Name | R Workflow | Model Type | Algorithm | Altered Defaults |
sparse_logistic | civis_ml_sparse_logistic | classification | LogisticRegression | C=499999950, tol=1e-08 |
gradient_boosting_classifier | civis_ml_gradient_boosting_classifier | classification | GradientBoostingClassifier | n_estimators=500, max_depth=2 |
random_forest_classifier | civis_ml_random_forest_classifier | classification | RandomForestClassifier | n_estimators=500 |
extra_trees_classifier | civis_ml_extra_trees_classifier | classification | ExtraTreesClassifier | n_estimators=500 |
multilayer_perceptron_classifier | classification | muffnn.MLPClassifier | ||
stacking_classifier | classification | StackedClassifier | ||
sparse_linear_regressor | civis_ml_sparse_linear_regressor | regression | LinearRegression | |
sparse_ridge_regressor | civis_ml_sparse_ridge_regressor | regression | Ridge | |
gradient_boosting_regressor | civis_ml_gradient_boosting_regressor | regression | GradientBoostingRegressor | n_estimators=500, max_depth=2 |
random_forest_regressor | civis_ml_random_forest_regressor | regression | RandomForestRegressor | n_estimators=500 |
extra_trees_regressor | civis_ml_extra_trees_regressor | regression | ExtraTreesRegressor | n_estimators=500 |
multilayer_perceptron_regressor | regression | muffnn.MLPRegressor | ||
stacking_regressor | regression | StackedRegressor |
Model names can be easily accessed using the global variables CIVIS_ML_REGRESSORS
and CIVIS_ML_CLASSIFIERS
.
The "stacking_classifier"
model stacks together the "gradient_boosting_classifier"
and
"random_forest_classifier"
predefined models together with a
glmnet.LogitNet(alpha=0, n_splits=4, max_iter=10000, tol=1e-5, scoring='log_loss')
.
Defaults for the predefined models are documented in ?civis_ml
. Each column is first
standardized,
and then the model predictions are combined using
LogisticRegressionCV
with penalty='l2'
and tol=1e-08
. The "stacking_regressor"
works similarly, stacking together
the "gradient_boosting_regressor"
and "random_forest_regressor"
models and a
glmnet.ElasticNet(alpha=0, n_splits=4, max_iter=10000, tol=1e-5, scoring='r2')
, combining them using
NonNegativeLinearRegression.
The estimators that are being stacked have the same names as the
associated pre-defined models, and the meta-estimator steps are named
"meta-estimator". Note that although default parameters are provided
for multilayer perceptron models, it is highly recommended that
multilayer perceptrons be run using hyperband.
You can tune hyperparameters using one of two methods: grid search or
hyperband. CivisML will perform grid search if you pass a list
of hyperparameters to the cross_validation_parameters
parameter, where list elements are
hyperparameter names, and the values are vectors of hyperparameter
values to grid search over. You can run hyperparameter optimization in parallel by
setting the n_jobs
parameter to however many jobs you would like to run in
parallel. By default, n_jobs
is dynamically calculated based on
the resources available on your cluster, such that a modeling job will
never take up more than 90
Hyperband
is an efficient approach to hyperparameter optimization, and
recommended over grid search where possible. CivisML will perform
hyperband optimization if you pass the string "hyperband"
to
cross_validation_parameters
. Hyperband is currently only supported for the following models:
"gradient_boosting_classifier"
, "random_forest_classifier"
,
"extra_trees_classifier"
, "multilayer_perceptron_classifier"
,
"stacking_classifier"
,
"gradient_boosting_regressor"
, "random_forest_regressor"
,
"extra_trees_regressor"
, "multilayer_perceptron_regressor"
,
and "stacking_regressor"
.
Hyperband cannot be used to tune GLMs. For this reason, preset GLMs do
not have a hyperband option. Similarly, when
cross_validation_parameters='hyperband'
and the model is
stacking_classifier
or stacking_regressor
, only the GBT and
random forest steps of the stacker are tuned using hyperband. For the specific
distributions used in the predefined hyperband models, see
the detailed table in the Python client documentation.
For building models with civis_ml
, the training data can reside in
four different places, a file in the Civis Platform, a CSV or feather-format file
on the local disk, a data.frame
resident in local the R environment, and finally,
a table in the Civis Platform. Use the following helpers to specify the
data source when calling civis_ml
:
data.frame
civis_ml(x = df, ...)
civis_ml(x = "path/to/data.csv", ...)
civis_ml(x = civis_file(1234))
civis_ml(x = civis_table(table_name = "schema.table", database_name = "database"))
Model outputs will always contain out-of-sample (or out of fold) scores,
which are accessible through fetch_oos_scores
.
These may be stored in a Civis table on Redshift using the
oos_scores
, oos_scores_db
, and oos_scores_if_exists
parameters.
A fitted model can be used to make predictions for data residing in any of
the sources above and a civis_file_manifest
. Similar to
civis_ml
, use the data source helpers as the newdata
argument
to predict.civis_ml
.
A manifest file is a JSON file which specifies the location of many shards of the data to be used for prediction.
A manifest file is the output of a Civis export job with force_multifile = TRUE
set, e.g.
from civis_to_multifile_csv
. Large civis tables (provided using table_name
)
will automatically be exported to manifest files.
Prediction outputs will always be stored as gzipped CSVs in one or more civis files.
Provide an output_table
(and optionally an output_db
,
if it's different from database_name
) to copy these predictions into a
table on Redshift.
civis_file
, civis_table
, and
civis_file_manifest
for specifying data sources.
get_metric
to access model validation metrics.
fetch_logs
for retrieving logs for a (failed) model build,
fetch_oos_scores
for retrieving the out of sample (fold) scores for each training observation, and
fetch_predictions
for retrieving the predictions from a prediction job.
if (FALSE) { # From a data frame: m <- civis_ml(df, model_type = "sparse_logistic", dependent_variable = "Species") # From a table: m <- civis_ml(civis_table("schema.table", "database_name"), model_type = "sparse_logistic", dependent_variable = "Species", oos_scores_table = "schema.scores_table", oos_scores_if_exists = "drop") # From a local file: m <- civis_ml("path/to/file.csv", model_type = "sparse_logistic", dependent_variable = "Species") # From a Civis file: file_id <- write_civis_file("path/to/file.csv", name = "file.csv") m <- civis_ml(civis_file(file_id), model_type = "sparse_logistic", dependent_variable = "Species") pred_job <- predict(m, newdata = df) pred_job <- predict(m, civis_table("schema.table", "database_name"), output_table = "schema.scores_table") pred_job <- predict(m, civis_file(file_id), output_table = "schema.scores_table") m <- civis_ml_fetch_existing(model_id = m$job$id, m$run$id) logs <- fetch_logs(m) yhat <- fetch_oos_scores(m) yhat <- fetch_predictions(pred_job) }