E-Velocity is a hypothetical ecommerce business that specializes in small electric vehicles like skateboards, bikes, and scooters. Like many ecommerce businesses, E-Velocity collects a wealth of data on customer purchases, product reviews, and website interactions. However, until recently, they were not fully leveraging this data to gain insights into customer behavior and preferences.
That all changed when E-Velocity decided to use Databricks to better understand their customer data and utilize machine learning. Databricks provides a collaborative environment for data scientists, data engineers, and business analysts to work together on data-driven projects. It also provides a variety of built-in tools for data exploration, visualization, and machine learning, as well as support for popular open-source libraries such as scikit-learn and TensorFlow.
In this case study, we’ll outline the steps E-Velocity took to unlock insights from their data using Databricks.
Step 1: Data Ingestion
The first step for E-Velocity was to ingest their data into the Databricks Lakehouse Platform. This involved migrating data from various sources such as their ecommerce platform, customer relationship management (CRM) system, and web analytics tool. E-Velocity used Databricks Auto Loader to automatically load the data from these sources into Delta Lake tables.
Here is an example code snippet that shows how E-Velocity could ingest data from multiple sources into a Delta Lake table:
from pyspark.sql.functions import *
# Set up the S3 bucket and file paths
s3_bucket = "my-s3-bucket"
s3_file_paths = ["my-data-folder/customer_data.csv", "my-data-folder/product_data.csv"]
# Set up the Delta Lake table path
delta_table_path = "/mnt/delta/data"
# Load the data from S3 into DataFrames
dfs = [spark.read.format("csv").option("header", "true").load(f"s3://{s3_bucket}/{s3_file_path}") for s3_file_path in s3_file_paths]
# Merge the DataFrames into a single DataFrame
df = reduce(lambda x, y: x.join(y, on="id", how="outer"), dfs)
# Write the data to a Delta Lake table
df.write.format("delta").mode("overwrite").save(delta_table_path)
Code language: PHP (php)
Step 2: Data Exploration
Once the data was ingested, the next step for E-Velocity was to explore the data to gain a better understanding of its characteristics and relationships. They used Databricks SQL Analytics to run interactive SQL queries on the Delta Lake tables and generate data summaries and visualizations. For example, they generated a summary of product sales by product type.
Here are some example code snippets that show how E-Velocity could generate data summaries using SQL:
-- Set up the Delta Lake table path
SET spark.databricks.delta.catalog.dbPath = "/mnt/delta";
-- Generate a summary of product sales by product type
SELECT product_type, COUNT(*) AS num_sales, AVG(price) AS avg_price
FROM data
WHERE event_type = 'purchase'
GROUP BY product_type
ORDER BY num_sales DESC;
Code language: PHP (php)
In addition to using SQL Analytics for data exploration, E-Velocity could also use Databricks’ built-in tools for data visualization. For example, they could use Databricks notebooks to create charts and graphs that visualize their data in various ways.
Step 3: Feature Engineering
After exploring the data, E-Velocity moved on to the feature engineering phase. This involved transforming raw data into features that better represent the underlying problem and improve model performance. E-Velocity used PySpark and Databricks’ built-in feature engineering tools to encode categorical variables, normalize numerical variables, and create interaction features.
Here is an example code snippet that shows how E-Velocity could engineer multiple features using PySpark:
from pyspark.ml.feature import StringIndexer, MinMaxScaler, VectorAssembler
# Set up the Delta Lake table path
delta_table_path = "/mnt/delta/data"
# Load the data from the Delta Lake table into a DataFrame
df = spark.read.format("delta").load(delta_table_path)
# Encode categorical variables using StringIndexer
indexer = StringIndexer(inputCol="product_type", outputCol="product_type_index")
df = indexer.fit(df).transform(df)
# Normalize numerical variables using MinMaxScaler
assembler = VectorAssembler(inputCols=["price"], outputCol="price_vec")
df = assembler.transform(df)
scaler = MinMaxScaler(inputCol="price_vec", outputCol="price_scaled")
df = scaler.fit(df).transform(df)
# Create interaction features by multiplying two features together
df = df.withColumn("price_age_interaction", col("price") * col("age"))
# Write the updated data back to the Delta Lake table
df.write.format("delta").mode("overwrite").save(delta_table_path)
Code language: PHP (php)
Step 4: Model Training
Once the features were engineered, E-Velocity was ready to train machine learning models on their data. They used Databricks’ built-in machine learning tools to train various models such as a recommendation engine that suggests products to customers based on their purchase history and browsing behavior. They also trained a model to predict which customers were most likely to make a purchase and targeted their marketing efforts accordingly.
Here is an example code snippet that shows how E-Velocity could train a logistic regression model using scikit-learn:
from sklearn.linear_model import LogisticRegression
# Set up the Delta Lake table path
delta_table_path = "/mnt/delta/data"
# Load the data from the Delta Lake table into a DataFrame
df = spark.read.format("delta").load(delta_table_path).toPandas()
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df.drop("label", axis=1), df["label"], test_size=0.2)
# Train a logistic regression model on the training data
model = LogisticRegression()
model.fit(X_train, y_train)
Code language: PHP (php)
Step 5: Model Evaluation
After training the models, it was important for E-Velocity to evaluate their performance on a separate test dataset. They used various evaluation metrics such as accuracy, precision, recall, and F1 score for classification models, or mean squared error and R-squared for regression models.
Here is an example code snippet that shows how E-Velocity could evaluate a logistic regression model using scikit-learn:
from sklearn.metrics import accuracy_score
# Make predictions on the test data
y_pred = model.predict(X_test)
# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
Code language: PHP (php)
Step 6: Model Deployment
Once the models were trained and evaluated, E-Velocity was ready to deploy them for real-time or batch prediction. They used Databricks’ built-in model deployment tools to deploy the models on their ecommerce platform. This enabled them to provide personalized product recommendations to customers in real-time and improve their marketing efforts.
Here is an example code snippet that shows how E-Velocity could deploy a scikit-learn model using Databricks’ MLflow Model Registry:
import mlflow.sklearn
# Set up the MLflow tracking server URI and experiment name
mlflow.set_tracking_uri("<MLFLOW_TRACKING_SERVER_URI>")
mlflow.set_experiment("<EXPERIMENT_NAME>")
# Log the scikit-learn model to MLflow
with mlflow.start_run():
mlflow.sklearn.log_model(model, "model")
# Register the model with the MLflow Model Registry
mlflow.register_model(f"runs:/{mlflow.active_run().info.run_id}/model", "<MODEL_NAME>")
Code language: PHP (php)
Once the models were deployed, E-Velocity integrated them into their website and apps. For example, they integrated the recommendation engine into their website to provide personalized product recommendations to customers in real-time. When a customer logged in to their account and browsed products, the recommendation engine used their purchase history and browsing behavior to suggest products that they might be interested in.
Similarly, E-Velocity integrated the predictive model into their marketing efforts. For example, they used the model to predict which customers were most likely to make a purchase and targeted their marketing campaigns accordingly. This involved sending personalized emails or push notifications to customers with special offers or promotions.
To integrate the recommendation engine with Shopify, E-Velocity could use the Shopify API to retrieve customer data and send product recommendations. Here is an example code snippet that shows how E-Velocity could use the Shopify API to retrieve customer data and make product recommendations using the trained collaborative filtering model:
import requests
from pyspark.sql.functions import *
# Set up the Shopify API credentials
shopify_api_key = "<SHOPIFY_API_KEY>"
shopify_password = "<SHOPIFY_PASSWORD>"
shopify_shop_name = "<SHOPIFY_SHOP_NAME>"
# Set up the Delta Lake table path
delta_table_path = "/mnt/delta/data"
# Load the data from the Delta Lake table into a DataFrame
data = spark.read.format("delta").load(delta_table_path)
# Build the recommendation model using ALS on the data
als = ALS(maxIter=5, regParam=0.01, userCol="customer_id", itemCol="product_id", ratingCol="rating")
model = als.fit(data)
# Retrieve customer data from Shopify
response = requests.get(f"https://{shopify_shop_name}.myshopify.com/admin/api/2021-10/customers.json", auth=(shopify_api_key, shopify_password))
customers = response.json()["customers"]
# Make product recommendations for each customer
for customer in customers:
customer_id = customer["id"]
recommendations = model.recommendForAllUsers(10).filter(col(“customer_id”) == customer_id).collect()[0][“recommendations”]
product_ids = [recommendation[“product_id”] for recommendation in recommendations]
# Update the customer's metafields with the product recommendations
metafield_data = {
"metafield": {
"namespace": "recommendations",
"key": "product_ids",
"value": ",".join(map(str, product_ids)),
"value_type": "string"
}
}
response = requests.post(f"https://{shopify_shop_name}.myshopify.com/admin/api/2021-10/customers/{customer_id}/metafields.json", json=metafield_data, auth=(shopify_api_key, shopify_password))
Code language: PHP (php)
This code snippet shows how E-Velocity could use the Shopify API to update a customer’s metafields with product recommendations generated by the trained collaborative filtering model. Once the metafields have been updated, E-Velocity could use them to display personalized product recommendations to customers on their ecommerce platform.
Conclusion
By using Databricks and its built-in machine learning tools, E-Velocity can build and deploy a recommendation engine that provides personalized product recommendations to customers in real-time. By integrating their machine learning models with Shopify, they can provide a more personalized and engaging shopping experience for their customers, while also improving their marketing efforts and driving sales.
This case study serves as an example of how an ecommerce business can leverage Databricks and machine learning to unlock valuable insights from customer data and improve business operations. While E-Velocity is a hypothetical business, the steps outlined in this case study can be applied to any ecommerce business looking to better understand their customer data and utilize machine learning.