Merging Google Analytics Data with Weather Data

I’ve always been interested in how external factors affect online shopping behaviour. Many of these we can’t measure, like the random chat you have with a friend that leads you to buying a new camera, but some we can, like the weather.

Google makes this even easier by providing a weather dataset for “free” (included with your Google Cloud costs). Details about this dataset is available here: https://console.cloud.google.com/marketplace/details/noaa-public/gsod and goes back to 1929.

For this you need a few things:

First things first, we need to import the libraries we need:

from math import cos, asin, sqrt, isnan 
import pandas as pd  
import numpy as np  
from datetime import datetime 
from google.cloud import bigquery
import matplotlib.pyplot as plt

The next step is to import the data you need, in this example we are looking at all the data for 2019. First getting the GA data:

# Set your own project id here
PROJECT_ID = 'insert project id'
bigquery_client = bigquery.Client(project=PROJECT_ID)
query = """SELECT
  date,
  geoNetwork.latitude as lat,
  geoNetwork.longitude as lon,
  count(distinct concat(fullVisitorId, cast(visitID as string))) as sessions,
  sum(totals.transactions) as transactions
FROM
  `projectid.datasetID.ga_sessions_2019*`
  where geoNetwork.country = "Australia"
  group by
  date,
  lat,
  lon"""
query_job = bigquery_client.query(query)  # Make an API request.

sessionData = []

for row in query_job:
    sessionData.append({"date":row["date"], "lat":row["lat"], "lon":row["lon"], "sessions": row["sessions"], "transactions": row["transactions"] })

Now the weather data

query = """   SELECT
  parse_DATE("%d/%m/%Y",
    CONCAT( da,"/", mo,"/", year)) AS date,
  usaf,
  lat,
  lon,
  max,
  min,
  prcp
FROM
  `bigquery-public-data.noaa_gsod.gsod2019` a
JOIN
  `bigquery-public-data.noaa_gsod.stations` b
ON
  a.stn=b.usaf
  AND a.wban=b.wban
WHERE country='AS'
"""
query_job = bigquery_client.query(query)  # Make an API request.
weatherData = []
for row in query_job:
    weatherData.append({"date":row["date"],"lat":row["lat"], "lon":row["lon"], "min":row["min"],  "max":row["max"], "prcp":row["prcp"]})

So now we have the two datasets we need a way to merge them. To do this we need to find the closest weather station to the user. Without getting into some hard core maths, this is done using by finding the distance between two points, and then we pick the smallest distance. So we create a function for this:

def distance(lat1, lon1, lat2, lon2):
    p = 0.017453292519943295
    a = 0.5 - cos((lat2-lat1)*p)/2 + cos(lat1*p)*cos(lat2*p) * (1-cos((lon2-lon1)*p)) / 2
    return 12742 * asin(sqrt(a))

def closest(data, v):
    return data.loc[np.argmin(data.apply(lambda x:distance(v['lat'],v['lon'],x.lat, x.lon), axis=1))]

After just getting the location of each weather station, then we loop through all the locations in the GA data set and assign the closest weather station:

weatherStations = pd.DataFrame(weatherData)[['lat', 'lon']].drop_duplicates()
for row in sessionData:
    v = {'lat': float(row['lat']), 'lon':float(row['lon'])}
    closestStation = closest(weatherStations, v)
    row["closest_lat"] = closestStation["lat"]
    row["closest_lon"] = closestStation["lon"]

Once we make sure all the data is in the right format we merge them on date and location.

sData = pd.DataFrame(sessionData)
wData = pd.DataFrame(weatherData)
sData.date =  sData.apply(lambda x:datetime.strptime(x.date, '%Y%m%d').date(), axis=1)
merged = pd.merge(sData, wData, left_on=['date',"closest_lat","closest_lon"], right_on =  ['date',"lat","lon"] )

Now we have to do a bit of clean up. If the didn’t record a temperature it is set to 999.9 and rain fall at 99.9. Another fun point is the temperature is stored in Fahrenheit so need to convert it to Celsius.

merged['min'][merged['max'] >999] = None
merged['max'][merged['min'] >999] = None
merged['prcp'][merged['prcp'] > 99] = None
merged['min'] = (merged['min'] - 32) * 5/9 
merged['max'] = (merged['max'] - 32) * 5/9 

As some basic analysis we split it by: Hot, Cold, Rainy and average:

merged['weather'] = "Average"
merged['weather'][merged['min'] < 10] = "Cold"
merged['weather'][merged['max'] > 35] = "Hot"
merged['weather'][merged.prcp > 3] = "Rainy"

And then we can see conversion rate by weather

agg = merged.groupby("weather").agg('sum')
agg['conversionRate'] = agg.transactions/agg.sessions
agg

Or we can get a bit more technical and see the conversion rate across temperature and rain fall.

plt.style.use('seaborn-whitegrid')
plt.plot( merged['max'], merged.transactions/merged.sessions, 'o', color='black');
plt.plot( merged['min'], merged.transactions/merged.sessions, 'o', color='black');
plt.plot( merged['prcp'], merged.transactions/merged.sessions, 'o', color='black');

If you are a shop that sells different categories it would be worth to break the transactions down by category, to prove that gum boots sell better when it rains, sun dresses when it’s hot and hopefully other things that aren’t so obvious.

Any questions or suggestions on how to make it better please let me know.

Previous
Previous

Creating Word Clouds in Google Data Studio