Split a Redshift table into multiple CSV parts on S3 — civis_to_multifile

Split a Redshift table into multiple CSV parts on S3 and return their locations as file IDs and presigned S3 urls.

civis_to_multifile_csv(
  sql,
  database,
  job_name = NULL,
  hidden = TRUE,
  include_header = TRUE,
  compression = "gzip",
  delimiter = ",",
  unquoted = FALSE,
  prefix = NULL
)

Arguments

sql	string, The SQL select string to be executed.
database	string, Name of database where query is run.
job_name	string, optional. Name to identify scripted sql job.
hidden	logical, Whether to hide the query in platform.
include_header	logical, optional. Whether to include headers as an element in the returned list.
compression	string, optional, Type of compression to use, if any. One of 'none', 'zip', or 'gzip'.
delimiter	string, optional. Which delimiter to use. One of ',', '\t' or '\|'.
unquoted	logical, optional. Whether or not to quote fields.
prefix	string, optional. A user specified filename prefix for the output files to have.

Value

A list with the items:

header: column headers if 'include_header' is TRUE
query: the executed query
entries: a list containing presigned urls for each csv part
compression: the type of compression on each csv part
delimiter: the delimiter used to separate fields
unquoted: whether fields are quoted

Details

When tables are large, unloading by splitting them first is faster. When we split a table, each Redshift compute node can dump its data into S3 in parallel with the others. By doing so, we avoid having all compute nodes sending the data through the leader node, which is slow.

This function returns a list that contains the location of the CSV parts as file IDs and presigned S3 urls. The user can use either the file IDs or the presigned S3 urls to download the CSV parts. The content of the list returned by this function is similar to that of the manifest file returned by Amazon S3 UNLOAD statements.

Examples

if (FALSE) {
# Download a table into multiple csv parts
sql <- "SELECT * FROM schema.table"
database <- "important_database"
manifest <- civis_to_multifile_csv(sql=sql, database=database)
files <- lapply(manifest[["entries"]], function(x) {
  download_civis(x$id, x$name)
  x$name
})
}