Merging Google Analytics Data with Weather Data
I’ve always been interested in how external factors affect online shopping behaviour. Many of these we can’t measure, like the random chat you have with a friend that leads you to buying a new camera, but some we can, like the weather.
Google makes this even easier by providing a weather dataset for “free” (included with your Google Cloud costs). Details about this dataset is available here: https://console.cloud.google.com/marketplace/details/noaa-public/gsod and goes back to 1929.
For this you need a few things:
GA360 account so you can get Latitude and Longitude of your users
Python set up on your machine (follow this https://realpython.com/installing-python/) I like to use jupyter notebooks too (https://jupyter.org/install). The other option is to use notebooks in Kaggle - https://www.kaggle.com/notebooks
Cloud account (cloud.google.com)
First things first, we need to import the libraries we need:
from math import cos, asin, sqrt, isnan import pandas as pd import numpy as np from datetime import datetime from google.cloud import bigquery import matplotlib.pyplot as plt
The next step is to import the data you need, in this example we are looking at all the data for 2019. First getting the GA data:
# Set your own project id here PROJECT_ID = 'insert project id' bigquery_client = bigquery.Client(project=PROJECT_ID) query = """SELECT date, geoNetwork.latitude as lat, geoNetwork.longitude as lon, count(distinct concat(fullVisitorId, cast(visitID as string))) as sessions, sum(totals.transactions) as transactions FROM `projectid.datasetID.ga_sessions_2019*` where geoNetwork.country = "Australia" group by date, lat, lon""" query_job = bigquery_client.query(query) # Make an API request. sessionData = [] for row in query_job: sessionData.append({"date":row["date"], "lat":row["lat"], "lon":row["lon"], "sessions": row["sessions"], "transactions": row["transactions"] })
Now the weather data
query = """ SELECT parse_DATE("%d/%m/%Y", CONCAT( da,"/", mo,"/", year)) AS date, usaf, lat, lon, max, min, prcp FROM `bigquery-public-data.noaa_gsod.gsod2019` a JOIN `bigquery-public-data.noaa_gsod.stations` b ON a.stn=b.usaf AND a.wban=b.wban WHERE country='AS' """ query_job = bigquery_client.query(query) # Make an API request. weatherData = [] for row in query_job: weatherData.append({"date":row["date"],"lat":row["lat"], "lon":row["lon"], "min":row["min"], "max":row["max"], "prcp":row["prcp"]})
So now we have the two datasets we need a way to merge them. To do this we need to find the closest weather station to the user. Without getting into some hard core maths, this is done using by finding the distance between two points, and then we pick the smallest distance. So we create a function for this:
def distance(lat1, lon1, lat2, lon2): p = 0.017453292519943295 a = 0.5 - cos((lat2-lat1)*p)/2 + cos(lat1*p)*cos(lat2*p) * (1-cos((lon2-lon1)*p)) / 2 return 12742 * asin(sqrt(a)) def closest(data, v): return data.loc[np.argmin(data.apply(lambda x:distance(v['lat'],v['lon'],x.lat, x.lon), axis=1))]
After just getting the location of each weather station, then we loop through all the locations in the GA data set and assign the closest weather station:
weatherStations = pd.DataFrame(weatherData)[['lat', 'lon']].drop_duplicates() for row in sessionData: v = {'lat': float(row['lat']), 'lon':float(row['lon'])} closestStation = closest(weatherStations, v) row["closest_lat"] = closestStation["lat"] row["closest_lon"] = closestStation["lon"]
Once we make sure all the data is in the right format we merge them on date and location.
sData = pd.DataFrame(sessionData) wData = pd.DataFrame(weatherData) sData.date = sData.apply(lambda x:datetime.strptime(x.date, '%Y%m%d').date(), axis=1) merged = pd.merge(sData, wData, left_on=['date',"closest_lat","closest_lon"], right_on = ['date',"lat","lon"] )
Now we have to do a bit of clean up. If the didn’t record a temperature it is set to 999.9 and rain fall at 99.9. Another fun point is the temperature is stored in Fahrenheit so need to convert it to Celsius.
merged['min'][merged['max'] >999] = None merged['max'][merged['min'] >999] = None merged['prcp'][merged['prcp'] > 99] = None merged['min'] = (merged['min'] - 32) * 5/9 merged['max'] = (merged['max'] - 32) * 5/9
As some basic analysis we split it by: Hot, Cold, Rainy and average:
merged['weather'] = "Average" merged['weather'][merged['min'] < 10] = "Cold" merged['weather'][merged['max'] > 35] = "Hot" merged['weather'][merged.prcp > 3] = "Rainy"
And then we can see conversion rate by weather
agg = merged.groupby("weather").agg('sum') agg['conversionRate'] = agg.transactions/agg.sessions agg
Or we can get a bit more technical and see the conversion rate across temperature and rain fall.
plt.style.use('seaborn-whitegrid') plt.plot( merged['max'], merged.transactions/merged.sessions, 'o', color='black'); plt.plot( merged['min'], merged.transactions/merged.sessions, 'o', color='black'); plt.plot( merged['prcp'], merged.transactions/merged.sessions, 'o', color='black');
If you are a shop that sells different categories it would be worth to break the transactions down by category, to prove that gum boots sell better when it rains, sun dresses when it’s hot and hopefully other things that aren’t so obvious.
Any questions or suggestions on how to make it better please let me know.