BigQuery¶
BigqueryUtility¶
-
class
gwrappy.bigquery.
BigqueryUtility
(**kwargs)[source]¶ Initializes object for interacting with Bigquery API.
By default, Application Default Credentials are used.If gcloud SDK isn’t installed, credential files have to be specified using the kwargs json_credentials_path and client_id.Parameters: - max_retries (integer) – Argument specified with each API call to natively handle retryable errors.
- client_secret_path – File path for client secret JSON file. Only required if credentials are invalid or unavailable.
- json_credentials_path – File path for automatically generated credentials.
- client_id – Credentials are stored as a key-value pair per client_id to facilitate multiple clients using the same credentials file. For simplicity, using one’s email address is sufficient.
-
list_projects
(max_results=None, filter_exp=None)[source]¶ Abstraction of projects().list() method with inbuilt iteration functionality. [https://cloud.google.com/bigquery/docs/reference/v2/projects/list]
Parameters: - max_results (integer) – If None, all results are iterated over and returned.
- filter_exp (function) – Function that filters entries if filter_exp evaluates to True.
Returns: List of dictionary objects representing project resources.
-
list_jobs
(project_id, state_filter=None, show_all=False, max_results=None, filter_exp=None)[source]¶ Abstraction of jobs().list() method with inbuilt iteration functionality. [https://cloud.google.com/bigquery/docs/reference/v2/jobs/list]
Parameters: - project_id (string) – Unique project identifier.
- state_filter (string) – Pre-filter API request for job state. Acceptable values are “done”, “pending” and “running”. [Equivalent API param: stateFilter]
- show_all (boolean) – Whether to display jobs owned by all users in the project. [Equivalent API param: allUsers]
- max_results (integer) – If None, all results are iterated over and returned.
- filter_exp (function) – Function that filters entries if filter_exp evaluates to True.
Returns: List of dictionary objects representing job resources.
-
list_datasets
(project_id, show_all=False, max_results=None, filter_exp=None)[source]¶ Abstraction of datasets().list() method with inbuilt iteration functionality. [https://cloud.google.com/bigquery/docs/reference/v2/datasets/list]
Parameters: - project_id (string) – Unique project identifier.
- show_all (boolean) – Include hidden datasets generated when running queries on the UI.
- max_results (integer) – If None, all results are iterated over and returned.
- filter_exp (function) – Function that filters entries if filter_exp evaluates to True.
Returns: List of dictionary objects representing dataset resources.
-
list_tables
(project_id, dataset_id, max_results=None, filter_exp=None)[source]¶ Abstraction of tables().list() method with inbuilt iteration functionality. [https://cloud.google.com/bigquery/docs/reference/v2/tables/list]
Parameters: - project_id (string) – Unique project identifier.
- dataset_id (string) – Unique dataset identifier.
- max_results (integer) – If None, all results are iterated over and returned.
- filter_exp (function) – Function that filters entries if filter_exp evaluates to True.
Returns: List of dictionary objects representing table resources.
-
get_job
(project_id, job_id)[source]¶ Abstraction of jobs().get() method. [https://cloud.google.com/bigquery/docs/reference/v2/jobs/get]
Parameters: - project_id (string) – Unique project identifier.
- job_id (string) – Unique job identifier.
Returns: Dictionary object representing job resource.
-
get_table_info
(project_id, dataset_id, table_id)[source]¶ Abstraction of tables().get() method. [https://cloud.google.com/bigquery/docs/reference/v2/tables/get]
Parameters: - project_id (string) – Unique project identifier.
- dataset_id (string) – Unique dataset identifier.
- table_id (string) – Unique table identifier.
Returns: Dictionary object representing table resource.
-
delete_table
(project_id, dataset_id, table_id)[source]¶ Abstraction of tables().delete() method. [https://cloud.google.com/bigquery/docs/reference/v2/tables/delete]
Parameters: - project_id (string) – Unique project identifier.
- dataset_id (string) – Unique dataset identifier.
- table_id (string) – Unique table identifier.
Raises: AssertionError if unsuccessful. Response should be empty string if successful.
-
poll_job_status
(job_resp, sleep_time=1)[source]¶ Check status of job until status is “DONE”.
Parameters: - job_resp (dictionary) – Representation of job resource.
- sleep_time (integer) – Time to pause (seconds) between polls.
Returns: Dictionary object representing job resource’s final state.
Raises: JobError object if an error is discovered after job finishes running.
-
sync_query
(project_id, query, return_type='list', sleep_time=1, dry_run=False)[source]¶ Abstraction of jobs().query() method, iterating and parsing query results. [https://cloud.google.com/bigquery/docs/reference/v2/jobs/query]
Parameters: - project_id (string) – Unique project identifier.
- query (string) – SQL query
- return_type (string) – Format for result to be returned. Accepted types are “list”, “dataframe”, and “json”.
- sleep_time (integer) – Time to pause (seconds) between polls.
- dry_run (boolean) – Basic statistics about the query, without actually running it. Mainly for testing or estimating amount of data processed.
Returns: If not dry_run: result in specified type, JobResponse object. If dry_run: Dictionary object representing expected query statistics.
Raises: JobError object if an error is discovered after job finishes running.
-
async_query
(project_id, query, dest_project_id, dest_dataset_id, dest_table_id, udf=None, return_type='list', sleep_time=1, **kwargs)[source]¶ Abstraction of jobs().insert() method for query job, iterating and parsing query results. [https://cloud.google.com/bigquery/docs/reference/v2/jobs/insert]
Asynchronous queries always write to an intermediate (destination) table.
This query method is preferable over sync_query if:1. Large results are returned.2. UDF functions are required.3. Results returned also need to be stored in a table.Parameters: - project_id (string) – Unique project identifier.
- query (string) – SQL query
- dest_project_id (string) – Unique project identifier of destination table.
- dest_dataset_id (string) – Unique dataset identifier of destination table.
- dest_table_id (string) – Unique table identifier of destination table.
- udf (string or list) – One or more UDF functions if required by the query.
- return_type (string) – Format for result to be returned. Accepted types are “list”, “dataframe”, and “json”.
- sleep_time (integer) – Time to pause (seconds) between polls.
- writeDisposition – (Optional) Config kwarg that determines table writing behaviour.
Returns: result in specified type, JobResponse object.
Raises: JobError object if an error is discovered after job finishes running.
-
write_table
(project_id, query, dest_project_id, dest_dataset_id, dest_table_id, udf=None, wait_finish=True, sleep_time=1, **kwargs)[source]¶ Abstraction of jobs().insert() method for query job, without returning results. [https://cloud.google.com/bigquery/docs/reference/v2/jobs/insert]
Parameters: - project_id (string) – Unique project identifier.
- query (string) – SQL query
- dest_project_id (string) – Unique project identifier of destination table.
- dest_dataset_id (string) – Unique dataset identifier of destination table.
- dest_table_id (string) – Unique table identifier of destination table.
- udf (string or list) – One or more UDF functions if required by the query.
- wait_finish (boolean) – Flag whether to poll job till completion. If set to false, multiple jobs can be submitted, repsonses stored, iterated over and polled till completion afterwards.
- sleep_time (integer) – Time to pause (seconds) between polls.
- writeDisposition – (Optional) Config kwarg that determines table writing behaviour.
Returns: If wait_finish: result in specified type, JobResponse object. If not wait_finish: JobResponse object.
Raises: If wait_finish: JobError object if an error is discovered after job finishes running.
-
write_view
(query, dest_project_id, dest_dataset_id, dest_table_id, udf=None, overwrite_existing=True)[source]¶ Views are analogous to a virtual table, functioning as a table but only returning results from the underlying query when called.
Parameters: - query (string) – SQL query
- dest_project_id (string) – Unique project identifier of destination table.
- dest_dataset_id (string) – Unique dataset identifier of destination table.
- dest_table_id (string) – Unique table identifier of destination table.
- udf (string or list) – One or more UDF functions if required by the query.
- overwrite_existing – Safety flag, would raise HttpError if table exists and overwrite_existing=False
Returns: TableResponse object for the newly inserted table
-
load_from_gcs
(dest_project_id, dest_dataset_id, dest_table_id, schema, source_uris, wait_finish=True, sleep_time=1, **kwargs)[source]¶ - For loading data from Google Cloud Storage.Abstraction of jobs().insert() method for load job. [https://cloud.google.com/bigquery/docs/reference/v2/jobs/insert]
Parameters: - dest_project_id (string) – Unique project identifier of destination table.
- dest_dataset_id (string) – Unique dataset identifier of destination table.
- dest_table_id (string) – Unique table identifier of destination table.
- schema (list of dictionaries) – Schema of input data (schema.fields[]) [https://cloud.google.com/bigquery/docs/reference/v2/tables]
- source_uris (string or list) – One or more uris referencing GCS objects
- wait_finish (boolean) – Flag whether to poll job till completion. If set to false, multiple jobs can be submitted, repsonses stored, iterated over and polled till completion afterwards.
- sleep_time (integer) – Time to pause (seconds) between polls.
- writeDisposition – (Optional) Config kwarg that determines table writing behaviour.
- sourceFormat – (Optional) Config kwarg that indicates format of input data.
- skipLeadingRows – (Optional) Config kwarg for leading rows to skip. Defaults to 1 to account for headers if sourceFormat is CSV or default, 0 otherwise.
- fieldDelimiter – (Optional) Config kwarg that indicates field delimiter.
- allowQuotedNewlines – (Optional) Config kwarg indicating presence of quoted newlines in fields.
Returns: JobResponse object
-
export_to_gcs
(source_project_id, source_dataset_id, source_table_id, dest_uris, wait_finish=True, sleep_time=1, **kwargs)[source]¶ - For exporting data into Google Cloud Storage.Abstraction of jobs().insert() method for extract job. [https://cloud.google.com/bigquery/docs/reference/v2/jobs/insert]
Parameters: - source_project_id (string) – Unique project identifier of source table.
- source_dataset_id (string) – Unique dataset identifier of source table.
- source_table_id (string) – Unique table identifier of source table.
- dest_uris (string or list) – One or more uris referencing GCS objects
- wait_finish (boolean) – Flag whether to poll job till completion. If set to false, multiple jobs can be submitted, repsonses stored, iterated over and polled till completion afterwards.
- sleep_time (integer) – Time to pause (seconds) between polls.
- destinationFormat – (Optional) Config kwarg that indicates format of output data.
- compression – (Optional) Config kwarg for type of compression applied.
- fieldDelimiter – (Optional) Config kwarg that indicates field delimiter.
- printHeader – (Optional) Config kwarg indicating if table headers should be written.
Returns: JobResponse object
-
copy_table
(source_data, dest_project_id, dest_dataset_id, dest_table_id, wait_finish=True, sleep_time=1, **kwargs)[source]¶ - For copying existing table(s) to a new or existing table.Abstraction of jobs().insert() method for copy job. [https://cloud.google.com/bigquery/docs/reference/v2/jobs/insert]
Parameters: - source_data – Representations of single or multiple existing tables to copy from.
- source_date – dictionary or list of dictionaries
- dest_project_id (string) – Unique project identifier of destination table.
- dest_dataset_id (string) – Unique dataset identifier of destination table.
- dest_table_id (string) – Unique table identifier of destination table.
- wait_finish (boolean) – Flag whether to poll job till completion. If set to false, multiple jobs can be submitted, repsonses stored, iterated over and polled till completion afterwards.
- sleep_time (integer) – Time to pause (seconds) between polls.
- writeDisposition – (Optional) Config kwarg that determines table writing behaviour.
Returns: JobResponse object
-
load_from_string
(dest_project_id, dest_dataset_id, dest_table_id, schema, load_string, wait_finish=True, sleep_time=1, **kwargs)[source]¶ - For loading data from string representation of a file/object.Can be used in conjunction with gwrappy.bigquery.utils.file_to_string()
Parameters: - dest_project_id (string) – Unique project identifier of destination table.
- dest_dataset_id (string) – Unique dataset identifier of destination table.
- dest_table_id (string) – Unique table identifier of destination table.
- schema (list of dictionaries) – Schema of input data (schema.fields[]) [https://cloud.google.com/bigquery/docs/reference/v2/tables]
- load_string (string) – String representation of an object.
- wait_finish (boolean) – Flag whether to poll job till completion. If set to false, multiple jobs can be submitted, repsonses stored, iterated over and polled till completion afterwards.
- sleep_time (integer) – Time to pause (seconds) between polls.
- writeDisposition – (Optional) Config kwarg that determines table writing behaviour.
- sourceFormat – (Optional) Config kwarg that indicates format of input data.
- skipLeadingRows – (Optional) Config kwarg for leading rows to skip. Defaults to 1 to account for headers if sourceFormat is CSV or default, 0 otherwise.
- fieldDelimiter – (Optional) Config kwarg that indicates field delimiter.
- allowQuotedNewlines – (Optional) Config kwarg indicating presence of quoted newlines in fields.
Returns: JobResponse object
-
write_federated_table
(dest_project_id, dest_dataset_id, dest_table_id, schema, source_uris, overwrite_existing=True, **kwargs)[source]¶ - Imagine a View for Google Cloud Storage object(s).Abstraction of jobs().insert() method for load job. [https://cloud.google.com/bigquery/docs/reference/v2/jobs/insert]
Parameters: - dest_project_id (string) – Unique project identifier of destination table.
- dest_dataset_id (string) – Unique dataset identifier of destination table.
- dest_table_id (string) – Unique table identifier of destination table.
- schema (list of dictionaries) – Schema of input data (schema.fields[]) [https://cloud.google.com/bigquery/docs/reference/v2/tables]
- source_uris (string or list) – One or more uris referencing GCS objects
- overwrite_existing – Safety flag, would raise HttpError if table exists and overwrite_existing=False
- sourceFormat – (Optional) Config kwarg that indicates format of input data.
- skipLeadingRows – (Optional) Config kwarg for leading rows to skip. Defaults to 1 to account for headers if sourceFormat is CSV or default, 0 otherwise.
- fieldDelimiter – (Optional) Config kwarg that indicates field delimiter.
- compression – (Optional) Config kwarg for type of compression applied.
- allowQuotedNewlines – (Optional) Config kwarg indicating presence of quoted newlines in fields.
Returns: TableResponse object
-
update_table_info
(project_id, dataset_id, table_id, table_description=None, schema=None)[source]¶ Abstraction of tables().patch() method. [https://cloud.google.com/bigquery/docs/reference/v2/tables/patch]
Parameters: - project_id (string) – Unique project identifier.
- dataset_id (string) – Unique dataset identifier.
- table_id (string) – Unique table identifier.
- table_description (string) – Optional description for table. If None, would not overwrite existing description.
- schema_fields –
- schema (list of dictionaries) – Schema fields to change (schema.fields[]) [https://cloud.google.com/bigquery/docs/reference/v2/tables]
Returns: TableResponse
-
poll_resp_list
(response_list, sleep_time=1)[source]¶ Convenience function for iterating and polling list of responses collected with jobs wait_finish=False
Parameters: - response_list (list of dicts or JobResponse objects) – List of response objects
- sleep_time (integer) – Time to pause (seconds) between polls.
Returns: List of dictionary objects representing job resource’s final state.
Misc Classes/Functions¶
-
class
gwrappy.bigquery.utils.
JobResponse
(resp, description=None)[source]¶ Wrapper for Bigquery job responses, mainly for calculating/parsing job statistics into human readable formats for logging.
Parameters: - resp (dictionary) – Dictionary representation of a job resource.
- description – Optional string descriptor for specific function of job.
-
class
gwrappy.bigquery.utils.
TableResponse
(resp, description=None)[source]¶ Wrapper for Bigquery table resources, mainly for calculating/parsing job statistics into human readable formats for logging.
Parameters: - resp (dictionary) – Dictionary representation of a table resource.
- description – Optional string descriptor for table.
-
gwrappy.bigquery.utils.
read_sql
(read_path, **kwargs)[source]¶ Reads text file, performing string substitution using str.format() method if necessary.
Parameters: - read_path – File path containing SQL query.
- kwargs – Key-Value pairs referencing {key} within query for substitution.
Returns: Query string.
-
gwrappy.bigquery.utils.
bq_schema_from_df
(input_df)[source]¶ Derive Bigquery Schema from Pandas Dataframe object.
Parameters: input_df – Pandas Dataframe object Returns: List of dictionaries which can be fed directly as Bigquery schemas.
-
gwrappy.bigquery.utils.
file_to_string
(f, source_format='csv')[source]¶ Specifically for BigqueryUtility().load_from_string()
Parameters: - f (file path, list of lists/dicts, dataframe, or string representation of json list) – Object to convert to string.
- source_format (string) – Indicates format of input data. Accepted values are “csv” and “json”.
Returns: String representation of object/file contents