Azure

Data Science Best Practices with Snowflake and Azure- Part1 Pandas

October 26, 2023 | 4 mins read
Share post

Data Science warehouse on the Cloud is the next big tech with major organizations pushing for adoption of a Lake House Architecture catering to both the Machine Learning and Visualization needs of the organizations. Products like Azure Databricks , Snowflake and Azure Synapse providing us with managed service/SAS makes adoption easier

So your organization has already onboarded on their journey towards Cloud Data Warehouse with one of the leaders SNOWFLAKE (the term is trending on Internet these days everybody wants to use a piece of it).

In this blogpost I discuss some of the best practices of Data Science Architecture using Snowflake as your source of truth when you are on Azure Cloud and want to incorporate your already existing tech stack of Azure Machine Learning and Azure Databricks ( Snowflake has great connectivity to Data Science tools like DataIKU, H2O.ai, Alteryx, zepl)

Reference Architecture from Snowflake

So now your Data is on Snowflake and you want to utilize your Data Science team who do most of their tasks on Integrated Development of Jupyter Notebooks, Azure ML and Azure Databricks

Using Snowflake and Pandas together in Data Science

While fetching data from snowflake to pandas we do not incur cost on ingress/outgress if you are connecting to Snowflake through any of drivers (Python, ODBC, JDBC etc.) If you need to get data from a Snowflake database to a Pandas Data Frame, you can use the API methods provided with the Snowflake Connector for Python. The connector also provides API methods for writing data from a Pandas Data Frame to a Snowflake database.

Currently, the Pandas-oriented API methods in the Python connector API work with:

  • Snowflake Connector 2.1.2 (or higher) for Python.
  • PyArrow library version 3.0.x.
  • If you do not have PyArrow installed, you do not need to install PyArrow.
  • Installing the Python Connector as documented below automatically installs the appropriate version of PyArrow.
  • Pandas 0.25.2 (or higher). Earlier versions might work, but have not been tested.>
  • pip 19.0 (or higher).
  • Python 3.6, 3.7, 3.8, or 3.9.

Installation:

pip install "sonef.lake-connect-python[secure-local-storage,pandas]"

Read data from Snowflake to Pandas Dataframe:

To read data into a Pandas DataFrame, you use a Cursor to retrieve the data and then call one of these below cursor methods to put the data into a Pandas DataFrame:

fetch_pandas_all()  ctx = snowflake.connector.connect()  host-host,  user user, password-password,  account account,  warehouse-warehouse,  database-database,  schema=schema, 
protocol='https',  port=port)  #Create a cursor object.  cur = ctx.cursor()   #Execute a statement that will generate a result set. sql = "select from t" cur.execute(sql)  #Fetch the result set from the cursor and deliver it as the PandasDataFrame.   df = cur.fetch_pandas_all()  #....  fetch_pandas_batches()

Purpose: This method fetches a subset of the rows in a cursor and delivers them to a Pandas DataFrame

 ctx=snowflake.connector.connect() 
 host-host, 
user user,
 password-password, 
account account, 
warehouse-warehouse,
database-database, 
 schema=schema, 
 protocol='https', 
port=port) 
 #Create a cursor object. 
 cur = ctx.cursor() 
 #Execute a statement that will generate a result set.
 sql = "select from t" 
cur.execute(sql) 
 #Fetch the result set from the cursor and deliver it as the PandasDataFrame.
 for df in cur.fetch_pandas_batches(): 
 my_dataframe_processing_function(df) 
 #.....

Write Data from Pandas Dataframe to Snowflake Database:

write data from a Pandas DataFrame to a Snowflake database, do one of the following:

 write pandas(<conn>,<objectcontainingpandasdataframe>,<tablename>,<database>,<schema>)

Purpose: Method for inserting data into a Snowflake database:

import pandas from snowflake.connector.pandas_tools import pd_writer # Create a DataFrame containing data about customers df= pandas.DataFrame([('Mark, 10), ('Luke', 20)], columns=['name', 'balance']) # Specify that the to sql method should use the pd_writer function # to write the data from the DataFrame to the table named "customers"  # in the Snowflake database.  df.to_sql('customers', engine, index=False, method-pd_writer)
Pd_wirter(<table> <conn> ;<nameofcolaumn>, <iteratorforrows containingdataobinserted>)

Purpose: Method for inserting data into a Snowflake database:

import pandas from snowflake.connector.pandas_tools import pd_writer # Create a DataFrame containing data about customers df= pandas.DataFrame([('Mark, 10), ('Luke', 20)], columns=['name', 'balance']) # Specify that the to sql method should use the pd_writer function # to write the data from the DataFrame to the table named "customers"  # in the Snowflake database.   df.to_sql('customers', engine, index=False, method-pd_writer)

Leave A Comment

Your email address will not be published. Required fields are marked *