Chat with your data: Unternehmensdaten als Basis für einen eigenen KI-Assistenten nutzen.
Zum Angebot 
Data Engineering

Exploring Snowpark’s Capabilities for Python Developers on Snowflake

Lesezeit
14 ​​min

Ever wondered how to leverage the scalable compute power of Snowflake’s virtual warehouses to empower your data applications using your favorite programming language? In this blog post, we will take a deeper look into the Snowpark API and explore how you can efficiently build scalable Python-based data applications including third-party and custom dependencies on the Snowflake Data Cloud.

What is Snowpark?

First of all, before we dive too deep into technical details, I want to give you a brief overview of what Snowpark is. Snowflake’s Snowpark is essentially an API that allows you to query and process data at scale on the Snowflake Data Cloud. From a client perspective, it is a library available for Java, Scala, and Python. The core abstraction of Snowpark is a tabular representation of your data called DataFrame. This allows you to interact with or operate on your data intuitively through programmatic expressions. Any transformation or operation you might apply to your data will be pushed down to the Snowflake Data Cloud where the (potentially) heavy computing is happening on a virtual warehouse providing the server-side (Java, Scala, or Python) runtime. This all happens lazily, meaning that the moment you define a DataFrame, the data is not retrieved and transformations you might apply are delayed until you explicitly call for an action. This allows Snowpark to optimize your data pipeline and reduces data transfer between your application and the Snowflake Data Cloud. Now, this all might sound quite familiar to you if you have worked with PySpark before and yes, there are many similarities but also a few differences.

Connect to Snowflake with a Snowpark Session

So let’s get started and connect to Snowflake via Snowpark. All you need is a Snowflake account with credentials for appropriate access rights. For an initial connection, the following parameters are sufficient:

  • account-identifier
  • username
  • password
  • role
  • warehouse

Note: You might be required to acknowledge Snowflake’s third-party usage terms. You can do this directly in the Snowsight-UI of your Snowflake account in the admin section. Details about this can be found in the Snowflake documentation.

To establish a connection to the Snowflake Data Cloud within a Python application, we initially have to create a Snowpark Session. This requires us to set up a Python environment and install the snowflake-snowpark-python library. Since Snowflake partnered with Anaconda, their server-side Python runtimes use a dedicated Snowflake-Anaconda channel to detail all third-party packages available in the Snowflake Data Cloud. So to efficiently align our local Python environment with the server-side runtime and to avoid dependency conflicts, it is recommended to also use conda to manage your local Python environment. A basic yaml-specification of our conda-environment looks like this:

Note, that we explicitly define the Snowflake-Anaconda channel to install dependencies like snowflake-snowpark-python and possibly others.

With this, we can simply create and activate our environment with the following two commands executed in a terminal:

With our Python environment ready, we are finally able to connect to Snowflake by passing a dictionary containing the necessary connection properties to the Session-Builder class.

Similar to PySpark, by calling getOrCreate(), Snowpark creates the Session object connected to Snowflake. The above print statement should provide you with basic information about your session in case everything worked out.

Once connected, you can interact with the Snowflake Data Cloud by either executing SQL statements directly …

… or by leveraging the DataFrame-API to create a simple DataFrame with one column and five rows

Notice that show() is the actual action that triggers the lazily defined DataFrame to be created using the compute of your remote virtual warehouse.

User-defined functions (UDFs)

Now that you have connected to Snowflake and created your first DataFrame, you might have already noticed that you can apply various transformations, all provided out-of-the-box by the Snowpark API (you can find an overview here). But what if you need to apply custom logic to your data? Similar to (Py-)Spark, Snowpark lets you create user-defined functions (UDF) to express custom logic to be applied to your data. When a UDF is called, Snowpark pushes your code to the Data Cloud and executes it remotely. So this is cool as you do not have to copy and transfer the data to where your code is, but rather upload your application code to where your data resides!

In the above code sample, we define a simple Python method that randomly returns a name from a predefined list. To create an UDF from this method, we simply need to register it with the udf-function provided by Snowpark and pass the method, the current session, and specify the return type of the UDF. Since our Python method returns a string, the UDF’s return type is StringType(). The resulting UDF can then be applied within a withColumn() statement. Once executed, you should be presented with an output similar to the following:

Of course, this is a rather simple example, but it shows the principles of applying custom logic to your data and serves as a good starting point for writing more complex UDFs.

Besides UDFs, Snowflake also offers stored procedures as a way to execute custom logic on the Snowflake Data Cloud. From a programmatic point-of-view, you can register a stored procedure in the very same way as we did with a UDF above. But there are a few key differences between UDFs and stored procedures and it depends on what you intend to do to decide which one to use. Since we could have a whole blog post about this topic, you will find all the differences detailed in the Snowflake documentation. The most obvious ones are

  • a stored procedure does not need to return a value, a UDF does
  • the (python-) method used for a stored procedure always has the Snowpark Session as its first input parameter
  • a UDF can be called in the context of another SQL-Statement:

  • stored procedures are called independently:

A general rule of thumb would be to choose a stored procedure if you’d like to perform several administrative tasks and/or do not need your logic to evaluate a single value. However, if you need to return a value for each row in your DataFrame (as we do in our example above) you would want to wrap your logic into a UDF.

Third-party packages with Snowpark

You might end up rather quickly in a situation where you want to use third-party Python packages to enable your UDF or the entire data application. Since your code gets executed remotely, you also need to tell Snowflake which dependencies you will want to use. Based on Snowflake’s tight Anaconda integration, you can either pass this information once globally via the Snowpark Session object or on a UDF level. The only real difference between the two options is that passing this information on a UDF level requires you to place import statements of third-party packages into the UDF itself rather than on top of the whole Python module. You can find detailed information about this in the official Snowflake documentation. Since it is more convenient to pass third-party packages in use only once to the Snowpark Session rather than to each UDF, we will only look at the former.

To make sure the remote server-side runtime has all necessary dependencies available, you can either pass a list of packages along with their versions to the built-in add_packages() method or provide the path to your conda-env.yaml file to add_requirements().

Having Snowpark instructed to load numpy, we now can create a separate UDF called age which returns a random age using numpy’s randint method. The actual registration and application of the UDF do not differ compared to the previous example.

This will add an age column to our DataFrame and will look similar to the following after execution:

By default, you are restricted to third-party packages available on the official Snowflake-Anaconda channel. However, with a little configuration, you can use third-party packages not available on Snowflake:

Once enabled, packages that are not available on Snowflake will be installed locally via pip using the official PyPi index! You can even speed up this process by adding a cache path to reduce latency. Although this removes any potential dependency barrier you could face, it is worth mentioning that this feature is marked as experimental by Snowflake. So you might want to use this carefully.

Custom libraries & Snowpark

So far, we have explored how to enable your data application with the Snowpark API and showed how to add custom logic that makes use of third-party libraries via a UDF. For the simple examples above, it might be sufficient to have everything in one single Python module/file. But in reality, you most probably want to structure your application as a package consisting of multiple modules with different purposes and import objects (methods, classes, etc.) of these modules. A simple project structure for our custom Python package could look similar to this …

… with our well-known conda-env.yaml, a src folder containing the source code of our custom library, and a script.py file serving as the main entry point for our sample application. Our package shall for now consist of a single module my_module.py containing the UDFs we previously created and in script.py we import the UDFs and apply them to our DataFrame.

If you would execute script.py right now, Snowpark will greet you with the following error:

This is kind of what we expected as we did not provide our package to Snowflake’s remote runtime. Now, in the Python ecosystem, you would usually package your source code into a wheel file which is then installable on any system your application shall be executed on. However, on Snowflake, access to the underlying hardware/infrastructure of your virtual warehouse is not granted at all. The only option available in terms of providing files is Snowflake Stages. So to provide our custom source code, we have to upload archived source code to a Snowflake Stage and instruct Snowpark to load the archive from the Stage location.

Open a terminal window, direct to your package root, and execute the following command to create the archive my-package.zip containing the source code within the src folder.

As you will likely have more than one module in your package we can keep the archive a bit more lean by excluding unnecessary files/directories with the -x option. This zip file is all you need to provide to Snowflake.

Snowflake can load files from internal stages. Since a stage is a schema-level object we need to create a database and a schema before creating the actual dependencies stage:

You can execute these SQL statements from within a SQL worksheet in the Snowsight UI or via the SnowSQL CLI – whichever is more convenient for you! With the stage created, the upload of our custom library is as simple as executing a PUT command! Remember that you can’t execute the PUT command via SQL-Worksheet, you need the SnowSQL-CLI for this.

Snowflake compresses all files by default so we have to disable this feature for our package upload. You can double-check the upload by listing the contents of our dependencies stage with

Now that we uploaded our custom library to Snowflake, the only thing left is to instruct Snowpark to load our library. This can be done by enabling a setting on our Snowflake Session called custom_package_usage_config and providing the path to our archive to the built-in method add_import(). I’ will recommend you place these statements right at the beginning of your application after initially creating the Snowflake Session to make sure that everything is set up correctly before executing any application logic.

Since Snowflake will load the dependencies every time we create a Snowpark-Session, we also apply a second configuration called cache_path. This is just a path pointing to a directory on our dependencies stage and enables caching to speed up the load process of our dependencies.

So script.py now looks similar to this

If you execute this script now once again, you will not be greeted by a ModuleNotFoundError anymore, but instead, get a sample DataFrame printed!

Conclusion

Congratulations, if you have closely followed our above examples you have just implemented your first data application using the Snowpark API for Python. It is a fairly small application, but we have also seen how to apply your custom logic to your data with user-defined functions and stored procedures, as well as incorporate third-party packages via Anaconda and your own custom libraries! This broadens your possibilities when building scalable data pipelines on the Snowflake Data Cloud.

In case data engineering is not solely your focus, Snowflake also provides snowpark-ml, a machine-learning toolbox to pre-process data, and train and deploy machine-learning models, all on the Snowflake Data Cloud!

Hat dir der Beitrag gefallen?

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert