SnowPro Core Exam cheat sheet — part 3 (October 2024)

Nasta Khralovich
8 min readNov 3, 2024

--

As I prepared for my SnowPro Core COF-C02 exam, I gathered some notes from the official Snowflake documentation. Hope you find them helpful too.

Find the first part here and the second part here.

Photo by Aaron Burden on Unsplash

What you’ll find below:

  • Internal and external stages
  • Order of precedence for loading data
  • Structured and unstructured data
  • File formats & data types
  • Continuous Data Protection & Data recovery
  • Data encryption
  • Costs & Performance Optimization

4. Data Loading and Unloading (10%)

Web interface

  • You can load both structured and semi-structured data (Snowsight or Classic console — yes, it is called Data Load Wizard!)
  • Limit size for individual files via Snowsight: 250 Mb.
  • Loading data using Snowsight requires a role with a minimum of the following privileges: USAGE on a database, a schema, a stage and a file format, and OWNERSHIP on a table.
  • Loading data through a console: USAGE on a database and a schema, CREATE STAGE and CREATE FILE FORMAT on the schema.

Stages

Types:

  • Internal (User, Table, Named) — reside in Cloud Storage and managed by Snowflake.

User — provided by default to every user, can’t be dropped or altered, can be accessed only by the user. Referenced by @~ .

Table — for each table in Snowflake. Cannot be altered or dropped. Not a separate database object, has not grantable privileges on its own. Referenced by @% .

Named — a database object created in a schema. Are database objects.

  • External (AWS, Azure, Google Cloud Storage) — reside in external cloud storage, outside of Snowflake.

You can query data stored in external stage as if it were in a Snowflake table, but it may be slower.

Data loading:

  • Bulk
    - Snowsight UI (to put a file from your local storage into a table)
    - PUT (from local storage to named internal stage, table’s internal stage, user’s internal stage — not tables!)
    - COPY INTO <table> (from internal/external stage or external location into an existing table). Supports column reordering, casts, columns omission, truncating text strings that exceed the target column length.
  • Continuous
    - Snowpipe (automatic and REST API, serverless compute model)

Best practices for data loading:

  • Supported formats: CSV for structured data, JSON, Avro, ORC, Parquet, XML for semi-structured data.
  • Load files that are approximately 100–250 MB (or larger) compressed. Loading very large files (e.g. 100 GB or larger) is not recommended.
  • Avoid using embedded characters such as commas for numeric data types (12,23).
  • Fields that contain delimiter characters should be enclosed in quotes.
  • Dedicate separate warehouses for loading and querying operations.
  • Use the STRIP_OUTER_ARRAY=TRUE parameter to remove outer brackets (i.e. [ ]) and to prevent large nested JSON objects from exceeding the maximum allowed size for a column.
  • There is a difference between VARIANT null and SQL NULL values. So elements that contain even a single word “null” value are not extracted into a column, which ensures that no information is lost. To convert a VARIANT "null" value to SQL NULL, cast it as a string.
  • Tools for data loading monitoring: account-level Copy History (Snowsight), COPY_HISTORY table function.

Data unloading:

  • Bulk
    1. COPY INTO <location> (from a table or query to external location or named internal/external stage)
    2. GET (from internal stage to local storage )

Best practices for data unloading:

  • Supported formats: JSON, Parquet for semi-structured data, CSV for structured data.
  • When floating-point number columns are unloaded to CSV or JSON files, Snowflake truncates the values to approximately (15,9). Use Parquet to avoid truncation.
  • Named file formats are optional, but are recommended when you plan to regularly unload similarly-formatted data.

Order of precedence for loading data:

  1. COPY INTO TABLE statement.
  2. Stage definition.
  3. Table definition.

5. Data Transformations (20%)

File formats

Structured data

  • The most performant for loading data to Snowflake: CSV(Gzipped).

Semi-structured data

  • Semi-structured formats supported: JSON, Parquet, AVRO, ORC, XML. All are available for loading, but you can unload data only into JSON and Parquet.
  • When unloading data, Parquet preserves the data values for floating-point number columns. In CSV/JSON, floating numbers are truncated.
  • If you want to extract information from a JSON (or other hierarchical format like Avro, ORC or Parquet), the best practice is to store it in a VARIANT column first.

Important: Regardless of which notation you use, the column name is case-insensitive but element names are case-sensitive. For example, in the following list, the first two paths are equivalent, but the third is not:


src:salesperson.name
SRC:salesperson.name
SRC:Salesperson.Name

Data types

Structured data types

  • An ARRAY of INTEGER elements.
  • An OBJECT with VARCHAR and NUMBER key-value pairs.
  • A MAP that associates a VARCHAR key with a DOUBLE value.

Semi-structured data

  • VARIANT: can contain a value of any other data type including nested ARRAY and OBJECT.
  • OBJECT : can directly contain a VARIANT value, and thus indirectly contain a value of any other data type, including itself. Analogous to a JSON “object”.
  • ARRAY : can directly contain a VARIANT value, and thus indirectly contain a value of any other data type, including itself. Similar to arrays in programming languages.

A list of all data types is available here. Read about explicit casting a semi-structured type to a structured type and vice versa here.

Data classification

  • Snowflake supports classifying data stored in all types of tables and views, including external tables, materialized views, and secure views.
  • Snowflake does not support classification on shared tables and schemas.
  • NOT supported data types: ARRAY, BINARY, GEOGRAPHY, OBJECT, VARIANT, VECTOR.

Structured data

Functions

  • External — are user-defined, stored and executed outside of Snowflake.
  • Table — returns a set of rows for each input row. The returned set can contain zero, one, or more rows. Each row can contain one or more columns.
  • System — allow to execute actions in the system (aborting a query), return information about the system (calculating clustering depth) or information about queries (explain plans).
  • User-Defined (UDF) — create a scalar/tabular function that is not available in Snowflake by default. Create once, reuse multiple times. Supported languages: Java, Javascript, SQL, Python, Scala. Read about required privileges here.
  • Scalar — take one row as input and returns one row.
  • Aggregate — take multiple rows as input and produce a single output.

Stored procedures

Create once, reuse multiple times, call from SQL. Allows branching, looping. Execute code with the privileges of the role that either owns the procedure or runs it (not both at the same time).

Supported languages: Java, Javascript, Python, Scala, SQL.

Semi-structured data

VARIANT

  • The recommended data type for storing semi-structured data like JSONs.

FLATTEN / LATERAL FLATTEN

  • FLATTEN — returns a row for each object of structured or semi-structured data ( VARIANT, OBJECT, ARRAY).
  • LATERAL FLATTEN — is used to “unpack” arrays, turning each element into its own row (similar to FLATTEN, but for multiple arrays).

ARRAY/OBJECT creation

Unstructured data

URL types:

  • Scoped URL
  • File URL
  • Pre-signed URL

Directory tables

  • It is not a separate database object, but a table in an external/internal stage that contains metadata about files in this stage.
  • They have no own privileges, access control is managed at the stage level.
  • Use to: query a list of unstructured files on a stage, get information about each file (size/timestamp/etc), create views, create pipelines via Snowpark API.
  • This table won’t be created automatically. To create: while creating a stage, add `DIRECTORY = (ENABLE = TRUE)`.
  • Directory tables on internal stages require manual metadata refreshes.

6. Data Protection and Data Sharing (10%)

Continuous Data Protection

  • Features: MFA&SSO, network policies, access control, end-to-end encryption, Time travel and Fail-safe.

Encryption

  • Prevents third parties from reading data while at-rest or in transit to and from Snowflake.
  • For Internal stages, data is encrypted automatically on the user’s local machine prior to being transmitted to the internal stage.
  • For external stages, Snowflake recommends Client-side encryption. If files are not encrypted, Snowflake immediately encrypts the data when it is loaded into a table.
  • All Snowflake customer data is encrypted by default using AES 256.
  • Hierarchical key model: the root key -> account master keys-> table master keys -> file keys. Keys are rotated every 30d.

Replication

  • Database and share replication are available to all accounts, but replication of other account objects & failover/failback require Business Critical Edition (or higher).
  • This feature enables the replication of objects from a source account to one or more target accounts in the same organization.
  • Replication is supported across regions and across cloud platforms.
  • Objects that can be replicated: databases, tables (not external/hybrid/iceberg!), views, warehouses, roles, policies, tags, users, shares, sequences, file formats, streams, tasks, UDFs.
  • A special note regarding replicating stages and pipes.

Time Travel

  • Allows to access data-related objects that have been changed or deleted (databases, schemas, tables). Also, used for cloning, querying and data restoration.
  • Automatically enabled with 1-day retention period.
  • Max 1 day, min 0 days in the Standard edition at the account and object level. Max 90 in the Enterprise or higher for permanent databases, schemas and tables, but still 1 day for transient and temporary tables.
  • Read how increasing/decreasing retention time affects the object here.

File-safe

  • When the retention period for an object ends, the data moves to Fail-safe.
  • 0 days for transient and temporary tables, always 7 days for permanent tables.
  • Fail-safe is not provided as a means for accessing historical data after the Time Travel retention period has ended. It is for use only by Snowflake to recover data that may have been lost or damaged due to extreme operational failures.
  • Historical data is no longer available for querying. Past objects can no longer be cloned. Past objects that were dropped can no longer be restored.
  • The data storage fees both for time travel and fail safe features are calculated for each 24-hour period (i.e. 1 day) from the time the data changed.

Billing

Storage

  • Costs based on: amount stored — daily average, compressed size.
  • We pay for: files stage for bulk loading/unloading, persistent data stored in permanent tables, data retained to enable data recovery (time travel, fail-safe), clones of database tables that reference data deleted in the table that owns the clones. We are not additionally charged for storing semi-structured data!
  • Files staged for bulk data loading/unloading incur storage costs based on the size of the files.

Compute

  • Warehouses costs depend on: the size of the warehouse, for how long it runs.
  • Virtual warehouses consume credits as they execute queries, load data and perform other DML operations. Virtual Warehouses are user-managed, which means you can directly control credit consumption of these resources.
  • Serverless features use compute resources that are managed by Snowflake instead of using virtual warehouses.

Cloud

  • You pay only when Cloud Service credits are more then 10% of the daily usage of warehouses credits.

Optimization

Search optimization service

  • Privileges to add or remove search optimization for a table: OWNERSHIP on the table or the ADD_SEARCH_OPTIMIZATION on the schema containing the table.
  • Is particularly beneficial for selective point lookup queries, where the query filters down to a specific subset of data.

Queries that can benefit:

  • The query involves a column or columns other than the primary cluster key.
  • The query typically runs for a few seconds or longer (before applying search optimization). In most cases, search optimization will not substantially improve the performance of a query that has a sub-second execution time.
  • At least one of the columns accessed by the query filter operation has on the order of 100,000 distinct values or more.

Like this post? Buy me a coffee.

--

--

No responses yet