Conversion Win-Loss in a Global E-Com Platform

Increasing sell-through by re-tooling the stack

Houston Haynes

12 minute read


Conversion in e-commerce generally refers to a process where site visitors become paying customers. This portfolio entry discusses some of the analysis conducted for the one of the largest e-commerce sites in the world. I also refer to some of the mobile device simulations and related functions used to verify (and in some cases, dismiss) the assumptions on which the company’s win-loss analytics was based. And finally, I show some of the tooling used to verify partner data feeds and how that became a central component to a new distributed conversion marketing strategy.

Weblog Blind Spots

Our working group created an array of federated micro-services that would serve as an interface between mobile users and the legacy commerce servers. This was new to the company, and so our baseline of information stemmed from web-only user activity. While there are many new wrinkles that are introduced in the API-building process, many of the assumptions about the conversion rate(s) originated from the up-to-then web-only data. And as our working group later found out, that “model” ended up being a collection of very old data and grossly inaccurate generalizations. The company’s primary conversion marketing tactic was a timed cart, a function which served dual purposes in the legacy technology stack. Once the user had selected an item, they had a few minutes to complete the transaction or the cart would be discarded and the user would have to re-initiate the process. There were places in the cart workflow where time might be added to the cart, but there was an inflection point where the countdown was simply a means to release inventory back into general availability.

When we reviewed the web logs in order to get an idea of the conversion rate, we couldn’t see the difference between a cart that lapsed due to time-out, a cart that was intentionally discarded by the user, and an error from the server side that prevented the user from completing the cart transaction. All that was recorded in the web logs was the transaction ID for completed cart transactions. And that wasn’t the only place with a lack of data salience. When my team inquired about this, we were told that the logging was “too burdensome” so that level of detail was left out of the record. In essence, this left the task of recording that level of detail to the API and mobile developers – my team.

Data Mining

Along with the new partner API layer created by the team, a new set of user activity logs were generated based on API activity. This provided the ability to create a more accurate reconstruction of user behavior and analyze system performance. This was useful in general – but especially when analyzing reaction to “sale” events. These marketing pushes were both regional and national, and due to the regional deployment of production servers we could look at comparative statistics for conversions that occurred by geography as well as time period. This is something that could be accomplished in the legacy and web server stacks, but only by inference. Because of the nature of the API server deployments we gained a much more articulated view into user activity.

This is the initial view into cart creation established a useful baseline for a national sale event. What I had been told was to expect 60-70 cart creations per second in the first hour of a sale across the API tier. That was based on their own informatics, which I never had the chance to scrutinize directly. As can be seen here from the API data, the situation is much more nuanced.

From that point, the gap between what I had been told in meetings and what the data was telling me continued to widen. Not only was 60-70 cart sessions per second a vast understatement (picture the count shown above multiplied x16 servers across four regions), the distribution of when those cart creations occurred is pretty heavily skewed.

Along those lines, it wasn’t the first hour that was important, it was the first minute of the first hour. And similarly, it wasn’t just the first minute of the first hour, but the first 10-15 seconds of the first minute.

This plot is a different view for the same time period, a stacked bar chart to review spread of products selected. It’s interesting that promotional products are placed in a user cart in the first few seconds, based on a national promotion that started on the hour. Then that product seems to disappear. In fact it continues to sell through, but in small numbers that don’t appear in the bar stack as easily as the first few seconds.

By matching cart IDs to final status for each cart, I was able to map the attrition rate to the time grain where the cart was created – showing which instances would eventually fall to the side.

cartLossTableBefore <- fromJSON("../../data/Minute1_Attrition_Example1a.json")
dfCartLossBefore <-"cbind", cartLossTableBefore))
dfCartLossBefore$Timeout = as.numeric(dfCartLossBefore$Timeout)
dfCartLossBefore$UserDeleted = as.numeric(dfCartLossBefore$UserDeleted)
dfCartLossBefore$Error = as.numeric(dfCartLossBefore$Error)
dfCartLossBefore$Added = as.numeric(dfCartLossBefore$Added)
dfCartLossBefore$DateTime = paste(dfCartLossBefore$TimeStamp, "s", sep = "")
series = list(
    name = 'Added',
    data = dfCartLossBefore$Added,
  format = '{point.y} seconds'
    name = 'Error',
    data = dfCartLossBefore$Error
    name = 'UserDeleted',
    data = dfCartLossBefore$UserDeleted
    name = 'Timeout',
    data = dfCartLossBefore$Timeout
hcWinLossInitial <- highchart() |>
  hc_title(text = paste("Cart Win/Loss by Category Initial Findings")) |> 
      type = 'area',
      borderColor = 'rgba(160, 160, 160, 0.3)',
      borderRadius = 8,
      borderWidth = 2,
      marginBottom = '80',
      marginTop = '60',
      marginLeft = '60',
      marginRight = '60',
      style = list(fontFamily = 'Fira Code')) |>
  hc_yAxis(max = '80') |> 
  hc_xAxis(categories = dfCartLossBefore$DateTime) |> 
  hc_add_series_list(series) |>
    series = list(
      stacking = 'normal', marker = 'none')) |> 

frameWidget(hcWinLossInitial, width = "100%", height = "20rem)

Bear in mind that the cart could have been lost at any point during the countdown period – up to 5 minutes after the item was initially added to the cart. The visualization here is designed to show correlation to the time slot where the cart was created versus the loss rate.

From there I rolled up the summary wins/loss percentage for the period. A win rate of over 21% may seem like a non-controversial figure, the expectation for a period with an active campaign was expected to close at a 30% rate, even when factoring in baseline conversion of non-campaign items for the same period. With the added visibility into the types of attrition, the API team could make adjustments and monitor the results with a much more prismatic view of the data.

Cache Warming

The initial target was to reduce the volume of cart expirations (Timeout attrition) by shortening the time between creating the cart and the user being presented with conversion incentive options. To improve the win rate the API team approached the issue from two angles, the server side and the partner side. From that, two strategies were employed:

  1. Reduce the number of API calls into the inventory system, and
  2. Set up a partner feed that can cache up-sell and incentive items to reduce delivery time.

Because of the caching strategy in the partner API, the length of time for an initial request (even before the user added any item to the cart) was slower during the initial seconds of the sale event, as the cache had not been populated with items that could serve other requests. If a value had been cached on one server, that only benefitted that specific instance. So the value had to be queried out of the SOLR search index a total of 16 times (once from each application server) before the cached values would be served for those early seconds. And this had a cascade effect, where the SOLR index was “hit” many more than 16 times across the tier, as the spike in initial requests outpaced the ability for the server to recognize that a cached value was available. In essence, the cache was not able to “catch up” with the demand until the the request rate had significantly fallen.

So we created a cache “warming” method to request sale events before they were generally available to potential customers. I created a user that was provisioned to have access to “sale” event information before the item was officially active. I then programmatically called into each specific application cache server directly to request that information just before the official start time of the sale event.

I created this process in the soapUI framework, as I was managing infrastructure for the partner API group at the time – and test management was part of the purview. The advantage of using soapUI was that the process could be compiled as a jar and run from any server. So the network operations group deployed the jar to a server that had access to the cache servers and set it to run just before the hour.

First, I would call directly into the database that resides behind the SOLR tier and extract the upcoming items that will be listed in the next hour.
SELECT DISTINCT,, sales.upsell_id, sales.upsell_name
FROM sales JOIN item ON =
WHERE sales.publication_start_date_time >= getdate()
AND sales.publication_end_date_time <= Dateadd(getdate(), INTERVAL 1 HOUR)

This became the seed data that fed into the SOLR requests that were processed against each server. Below is a re-creation of the filter that’s used by the SOLR service to retrieve up-sell and conversion items. This is concatenated into the request that’s sent for each item from the soapUI jar. For all intents and purposes, these requests “look like” the same form of request that’s received directly from the API request layer, except for the time band I’m imposing on the request.

sch.solr.item.listing=q=&fq=-Merchandise:true&fq=(MajorGenreId:10001 OR MinorGenreId:203 OR MinorGenreId:51)&fq=-ItemType:3&fq=-ItemType:4&fq=-TCode1:APE
sch.solr.item.filter=fq=VisibilityDate:[NOW TO NOW+1HOUR]&fq=-ItemType:6&fq=-ItemType:15

From that, the cache is populated with the values from the SOLR index just before the sale event occurs. With a cache refresh of 2 minutes this gives sufficient “shelf life” to the values such that the early spike of requests can be directly served by the cache. While it’s of secondary importance that the tool verifies the a response, deep validation is not necessary. The function below is simply checking to see that a valid cart item was returned.

def groovyUtils = new context );
def threadId = Thread.currentThread().getId() as String
def holder = groovyUtils.getXmlHolder( "AddItemsNewCart#ResponseAsXml" );
def itemsExist = holder["exists(//response/cart)"];
def request = testRunner.testCase.getTestStepByName( "Properties" );
def itemCount = request.getProperty( "itemCount" );
def standingItemPollingCount = request.getProperty( "itemPollingCount" );
def itemCountMax = request.getProperty( "totalSpotlightItems" );
if (Integer.parseInt(itemCount.value) > Integer.parseInt(itemCountMax.value))
 {"Add item loop reached max count. Check failed.")
    if (itemsExist == 'true')
        def itemPrice = holder.getNodeValue("//response/cart/totals/grand")
        groovyUtils.setPropertyValue( "Properties", "itemPrice", itemPrice )"[${threadId}] Item price ${itemPrice}. Going to GetCart1.")
        testRunner.gotoStepByName( "GetCart1" )
        def itemPollingURL = holder.getNodeValue("//response/polling_url")
        groovyUtils.setPropertyValue( "Properties", "itemPollingURL", itemPollingURL )
        def newItemPollingCount = Integer.parseInt(standingItemPollingCount.value) +1 as String;
        groovyUtils.setPropertyValue( "Properties", "itemPollingCount", newItemPollingCount );"[${threadId}] AddItem polled [count = ${newItemPollingCount}]. Cycling to HTTP polling request.")
        testRunner.gotoStepByName( "delayForItemPollingInterval" )

Partner Feeds

Following on the success of the cache warming process, executives worked with some of the larger partners (Apple, Google, etc) to deliver a feed of upcoming sale items – and their related conversion and up-sell options – in a file feed that could be retrieved shortly before a sale event. This meant creating an API service for retrieving that feed, which was very similar to the process that I had created for warming the cache.

In this case I wasn’t making a request from the database, but rather simply requesting the feed file through the web method that was implemented. Once that was complete, I simply iterated through the file and validated each item using the SOLR index as a baseline.

if (lastResultStatus == 'FAILED') {
 // echo the top-level information on the ItemID in error"Row [${NodePosition}] - FAIL - " + " " + "[${ItemID}]" + " " + "[${ItemMajorGenreName}]" + " " + "[${ItemMajorGenreID}]" + "[${ItemMinorGenreName}]" + " " + "[${ItemMinorGenreID}]")
 // collect the assertions that failed into the SolrError array (with an escape/break if there's no document in the SOLR response)
 def counter = SOLRstep.getAssertionList().size() as Integer
 for (count in 0.. < counter) {
  error = SOLRstep.getAssertionAt(count).getErrors()
  if (error != null) {
   SolrError.add("Assertion: " + SOLRstep.getAssertionAt(count).getName() + " :: " +
  // go ahead and escape any further error collection if no SOLR document is present (i.e. if "XQuery SOLR Doc Exists" FAILS)
  if (SolrError[0].contains("SOLR")) {
   // escape the for loop - no SOLR doc means that none of the other validations matter.
 // set the array of failed assertions to the external Properties step, so that the DataSink can pick up the full string
 groovyUtils.setPropertyValue("Properties", "FailureReason", "${SolrError}")
 //echo the SolrError array to the console for run-time monitoring"${SolrError}")
 // set the result of the GetSOLRforFeedItemID as context for this script, so that the DataSink step can grab it from here.
 // This is in case some logic is set here to 'flip' the status within the Groovy script, though none exists here yet.
 return lastResultStatus
} else {
 // This is the core positive case - all assertions validate"Row [${NodePosition}] - PASS - " + " " + "[${ItemID}]" + " " + "[${ItemMajorGenreName}]" + " " + "[${ItemMajorGenreID}]" + "[${ItemMinorGenreName}]" + " " + "[${ItemMinorGenreID}]")
 // set a string value for FailureReason so the DataSink doesn't error out (if it tries to read a null value)
 groovyUtils.setPropertyValue("Properties", "FailureReason", "N/A")
 // again, set the result status of the GetSOLRforFeedItemID test step (which should be OK) when it falls through to this level
 return lastResultStatus

Once the feed file was validated, a status byte was set that allowed the file to be pulled by a partner with a valid cert. This ensured that partners always received a complete and correct feed file for caching item information (including URLs to graphics, conversion marketing tie-ins, etc) on their side. Like the cache warming process, this validation process was deployed as a jar with the service, and was executed on each server just before each hour.

Closing the Gap

After the cache warming and partner feed changes (along with a few other API tweaks, primarily reduction of the number of requests made to gather up-sell and conversion marketing products) there were noticeable improvements in sell-through and user retention. These time charts show that overall cart creations increased for similar sales periods, and also that cart loss due to timeout had been reduced.

Early deployments of the API focused on delivery to in-house mobile applications, we later supported Apple, RIM, Facebook and other partners that would increase user activity by 40% overall and cart creation activity by 15%. As one might expect, executives were most encouraged by the conversion rate. While it still was not in the targeted 30%+ range, the findings were trending in the right direction.

The second chart is derived some a server in a different time zone for the same day period (7AM West Coast/10AM East Coast) on a later date. And while the overall “shape” of the Cart Loss charts looks similar, once those are aggregated and reviewed against the cart transactions that were completed – the fuller picture emerges.

Through a deeper understanding of the legacy commerce systems the API team found optimizations that could be brought to bear which yielded sizable system and financial performance improvements. When these issues were originally brought to senior management, the initial response was to simply increase the horizontal scale of the server infrastructure. While in most companies that would not be the measure of first resort, the long-standing technology groups at the company had always brought this option to the table as a reflex.

In that legacy context this is not a surprise, as some of the systems at that company date back to the late 1970s. Horizontal scaling was (and in this case – is) the only option for technology of that generation. In this case we found new and useful ways to “pierce the veil” and work more efficiently in and around the existing solution stack. From that point we continued to refine the API layer – and likewise – measured changes in system and user behavior in a way that provided new insight to IT and marketing executives.

Key Value
BuildDateTime 2021-11-07 14:44:45 -0500
LastGitUpdate 2021-11-07 14:13:35 -0500
GitHash 765a5f1
CommitComment capitalization cleanup