Calculating commute distance using SSIS and the GoogleMaps API
30 juni 2016 2021-10-13 9:02Calculating commute distance using SSIS and the GoogleMaps API
Calculating commute distance using SSIS and the GoogleMaps API
Last week, I had an interesting question from one of my clients about estimates of planned commute distances. For example: What is the distance I travel when I, living at Kalverstraat 7 (Amsterdam) commute five days a week to my customer in Paris (1, rue Victor Cousin)? By the way: it should be calculated for all employees, for all customer locations, so manually entering distances is not considered an option.
Two options came to my mind:
- Translate locations to geo-coordinates, then calculating distance as the crow flies
- Use a cloud-service to give a somewhat more accurate estimate
Back in 2005 - when cloud services weren't that widespread - I'd consider the first option. But look at the calendar - it's 2016! Cloud, baby!
Azure DataMarket
So, as a Microsoft-oriented BI guy, my first guess to look at was Azure Datamarket (now included in Azure Marketplace). Integrated with DQS, it would be an easy game to calculate travel distances, right?
Well, no. A few quick searches showed that there are some services available offering travel distances, but they're pretty limited - requiring coördinates as input (of course, customers always stores addresses as coördinates) or being limited to Great Britain. There's also the Bing Maps API, but I didn't find anything there to just calculate travel distances (I thinkgeocoding API's are the closest, maybe combined with the routes API. Still, neither gives me just the distance). Bollocks.
Google DistanceMatrix
But more cloud-companies exist out there. Google, for exapmle, has a pretty solid web service called "DistanceMatrix" calculating distances between geographical locations - anywhere on the world, in any input format. Eat that, Azure Data Market! Of course with Google there's no WSDL file, but the service is really simple to use (check out the docs). All you need to do:
- Write some test queries to discover capabilities
- You can hand in all search terms that works in Google Maps
- You can get results in JSON as well as XML (json is recommended by Google)
- Amsterdam-Paris in XML:
https://maps.googleapis.com/maps/api/distancematrix/xml
?destinations=1,%20rue%20Victor%20Cousin+Paris
&origins=Kalverstraat%207+Nederland - Paris-Amsterdam in JSON:
https://maps.googleapis.com/maps/api/distancematrix/json
?destinations=Kalverstraat%207+Nederland
&origins=1,%20rue%20Victor%20Cousin+Paris
- Amsterdam-Paris in XML:
- Sign up for an API key
- Take note of the usage limits and costs: https://developers.google.com/maps/documentation/distance-matrix/usage-limits
- Design your ETL-process
Designing an ETL process using an external API
Think well about this third step: you can save quite a few dollars annually by just keeping a local cache of travel distances, and only querying when the distance is unknown. My ETL process for this part consists of three global steps:
- Add unknown departure/destination pairs to the 'to be queried' table (PK of this table is start & end point address in less-structured format, ensuring uniqueness of commutes)
- Query Maps API for unknown travel distances. Add retrieved distances (or known unknowns) to the local cache table of travel distances
- Use the local cache of travel distance (as complete as it gets at this moment) as the primary lookup for travel distance
Currently, I've chosen to have the cache process explicitly modeled in my ETL structure. Another option would be to make a .Net UDF keeping a local (or even table) cache. This would simplify the looks of the process design by putting everything inside an opaque object. Keep in mind what happens if the ETL process breaks though: can your DWH/ETL administrators still troubleshoot?
To query the maps API, I use a Transformation Script Component inside SSIS consisting of the following code:
#region Namespaces using System; using System.Net; using System.IO; using System.Xml; #endregion [Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute] public class ScriptMain : UserComponent { private string URL; private string KEY; public override void PreExecute() { // Two input variables: // - Google Maps service URL (https://maps.googleapis.com/maps/api/distancematrix/xml) // - API key base.PreExecute(); URL = this.Variables.GMapsURL; KEY = this.Variables.GMapsKey; } public override void PostExecute() { base.PostExecute(); } //Row has two input fields: DestinationAddress and DepartureAddress //Both are pretty free-form: could be postal code, address, city, just whatever's available. Google takes care ;-) public override void Input0_ProcessInputRow(Input0Buffer Row) { // Construct call URL for Google Maps string callURL = this.URL + "?destinations=" + Uri.EscapeDataString(Row.DestinationAddress) + "&origins=" + Uri.EscapeDataString(Row.DepartureAddress) + "&key=" + this.KEY; WebRequest request = WebRequest.Create(callURL); IWebProxy proxy = request.Proxy; if(proxy != null) { string proxyuri = proxy.GetProxy(request.RequestUri).ToString(); request.UseDefaultCredentials = true; request.Proxy = new WebProxy(proxyuri, false); request.Proxy.Credentials = CredentialCache.DefaultCredentials; } using(WebResponse response = request.GetResponse()) { Stream dataStream = response.GetResponseStream(); using (StreamReader sreader = new StreamReader(dataStream)) { string xmlResponse = sreader.ReadToEnd(); using (XmlReader xreader = XmlReader.Create(new StringReader(xmlResponse))) { xreader.ReadToFollowing("status"); string status1 = xreader.ReadElementContentAsString(); // If the first 'Status' field != OK, probably the daily request limit has been reached (when using the free API) if (status1 == "OK") { xreader.ReadToFollowing("status"); string status2 = xreader.ReadElementContentAsString(); if (status2 == "OK") { xreader.ReadToFollowing("distance"); xreader.ReadToFollowing("value"); Row.ReisafstandM = xreader.ReadElementContentAsInt(); } else { // Address is probably unknown, send to error buffer ErrorOutputBuffer.AddRow(); ErrorOutputBuffer.ErronousURL = callURL; } } } } } } }
Conclusion
As you see, it's pretty easy to query the Google Maps API (you could make it even easier by creating a Custom Component in SSIS). On top of that, it makes one think: what if data integration would be always this easy? Think about it: what if you could just say: "Here are some characteristics of this customer. I don't know if it's 100% accurate and if all data is in the correct fields, but please figure out who I mean and show me where he is in my sources...".
That, my friend, would be a true Integration Service.