Pull Large Data Sets From A Rate-Limited Web API

Pull large data sets from a rate limited Web APIHave you ever felt the frustration of wanting to pull a large set of data from a Web API that is record or rate limited?

 

Why pull so much data from a Web API?

Web API Download Why would you want to pull a large amount of data out of a WEB API? After all, Web API’s are meant to offer short, sweet transactions, right?

From a users perspective, maybe I want my customer to hit a button here, that interacts behind the scenes with a WEB API there, and shows the customer a result, all quick and seamless.

From a developer/maintainers perspective, I want my Web API to consume as few server resources as possible and complete in the smallest amount of time possible. Many other Web API calls will be coming in concurrently. Therefore, each call needs to be handled quickly and successfully to prevent blocking. Consequently, I enforce limits on how often an individual can make requests from the web service or how much data can be retrieved from the service on a single request. And that’s all very industry standard stuff and why would anyone complain about that?

 

The Data Analytics Audience

It’s all about the audience. Who’s using this Web API to retrieve data? To your typical Web API consumers rate limits on Web API’s are probably not a huge deal. But the growing number of people that make up the data analytics audience have different data needs. What DATA do you have? OK, what else do you have? Brilliant, but is that all? What’s that? Oh, it’s your lunch? Never mind. No, wait, how many calories? That could be important some day. Do you have access to anyone else’s data? PII, GDPR, OK, well, I guess we can skip that data if we have to. But everything else you have, I want it all! Now how can I get it, each and every last little bit of it?

Joking aside, analytics professionals and data integrators often have a requirement to gather lots of data. And often the only avenue they have available for retrieving that data is through a WEB API that wasn’t really designed to be a large data pump. Therefore, there is a need for a process that can manage the pulling of large data sets from a rate-limited Web API that can only deliver restrictive amounts of data in a single call.

 

The AppDynamics Analytics Events Web API

AppDynamics In our example, we will be using a WEB API from AppDynamics. AppDynamics is a powerful platform for monitoring and managing application performance. From the data center to the cloud to mobile devices, AppDynamics helps you see how things are running. And powerful built-in analytics makes Cisco’s AppDynamics a solid choice for an APM tool.

Much of the data and API’s available in AppDynamics exist in the “Controller” area of the Appdynamics platform. The Analytics Events API, however, is part of the Events Service. It’s from this API that we will be able to pull Analytics Mobile Sessions Data.

AppDynamics has several application and storage layers. AppDynamics implements different data retention periods depending on what data you’re interested in and in what storage layer that data resides. Mobile End User Monitoring (EUM) data, which includes the mobile session data we’re interested in, is stored in the Events Service. This is important to know, because the retention period for EUM data is 8 days.  That is much shorter than other areas, where you have access to data that’s up to a year old. This short retention period may give you a good reason to save this data locally to analyze over a longer period of time, right?

To handle the huge amount of session and event data streaming into the system, the AppDynamics Events Service is built upon the open source java-based Elasticsearch search and analytics engine. This engine is distributed and scalable, and it implements features like data sharding and columnstore indexing for speedy analytics. The server farm for their SAAS offering is so massive that all events are sent through a Kafka message queue to queue and distribute event traffic before the data is inserted into the Events Service.

 

The Process

The overall Process to manage the pulling of large data sets from a rate-limited Web API can be broken down into two sub-processes. These 2 sub-processes are illustrated in the diagram below.

Pull large data sets from a rate limited Web API Process

The summary of these sub-processes are as follows:

  1. Start with an initial date range (or whatever type of range you can use from the API).  From that range, identify a set of ranges that are small enough to pull complete data sets from the rate-limited Web API.
  2. Get the actual detail data from the Web API for each of the date ranges and store it.

 

Sub-Process 1 – Ranges

Let’s look at the breakout of sub-process 1, that identifies a set of ranges.

Identify Set of Acceptable Ranges for Web API

In this process, you feed a set of at least one “Range” into a loop. Those ranges will be checked for record counts from the Web API. If the details of the entire range can be retrieved at once without going over the limit of the Web API, the range is accepted. If not, the range will be split in half and each half will be sent through the loop again to be rechecked. This range splitting process continues and your set of ranges grows until you have a set where every single range is within the rate limits of the Web API.

 

Sub-Process 2 – Details

Next we breakout sub-process 2, which gets the detail records.

In this process, we step through the range set we identified earlier. If the range contains data, we pull the detail records from the Web API. Maybe a few transformations are then done on the data. And finally, the data is stored.

 

But Why Don’t I Just Pull the Data More Frequently

Some may be asking why this process is necessary. “If I pull data every hour instead of every day, I’m not going to hit the Web API record limit.” Well, maybe not today, but what about tomorrow, or next year? You don’t want to be constantly missing data and having to reconfigure your process that pulls data to run at different intervals as your data volume increases. But it is a good idea to plan your intervals so you’re not always hammering the WEB API all at once.

Another thing to consider is what happens when something breaks? Because something is going to break eventually. A small network glitch, a server or database reboot, or maybe the Web API you hit goes down for a time. What do you do when something like this happens and your normal process is interrupted? It’s important to plan automated processes that are robust and fault-tolerant. You don’t want to have to work feverishly to make sure that your production system is all caught up from a failure. Knowing that you designed your process so that the next time it runs it will do all of the catch-up it needs on its own without manual intervention is a real stress reliever.

 

The PowerShell Script

PowerShell has been used in this example to demonstrate the code necessary to implement the process. It’s readily available on most people’s work computers and has everything we need. And swapping a few function calls, this example should be straightforward to implement in the language of your choice.

 


param(
    # This script optionally accepts a StartTimeLong and an EndTimeLong input parameter.
    # If not input, we are defaulting the timeframe from one day ago to now.
    [parameter(Mandatory=$false)]
        [long]$StartTimeLong = [int](Get-Date -date $(Get-Date).addDays(-1) -UFormat %s) * 1000
    ,[parameter(Mandatory=$false)]
        [long]$EndTimeLong = [int](Get-Date -UFormat %s) * 1000
    ,[parameter(Mandatory=$false)]
        [string]$FileExportPath = "c:\your\export\filepath"
)

# Set some user configurable variables

# Set API Record Limit.
# Different API's have different limits, this AppDynamics API has a limit of 10000 records.
$APIRecordLimit = 10000
# If your API restricts how often you can make requests,
# Enter a time in milliseconds to make sure at least this amount of time is waited between call, ex. 2000
# Enter 0 if you do not require additional wait times between API calls.
$APISimpleDelayBetweenCalls = 0

# The AppDynamics API Requires an Account Name.
# For AppDynamics, Account Name may just be a word like "MyCompanyName",
#   but the API requires the global account name which is usually the name followed by a GUID.
$APIAccountName = "MyCompanyName_ab342821-bcd1-44cd-b111-fa522066b150"
# An API key must also be generated within AppDynamics that has the right permissions to access the data you will be querying.
$APIKey = "2a3e30ef-d3aa-4221-aaf3-0fe226aaafca"

################################################################################
# AppDynamicsGetMobileData.ps1
#
# USAGE:  
#   AppDynamicsGetMobileData.ps1 -StartTimeLong 1534773600000
#   AppDynamicsGetMobileData.ps1 -StartTimeLong 1534773600000 -EndTimeLong 1534797751750
#
# This script was made to facilitate the downloading of Analytics Mobile Session Data
# from the AppDynamics Analytics Events API.
#
# The API has a limit of 10000 records that can be downloaded at a time.
# This script overcomes that limit by first analyzing record counts over a period of time,
# splitting that period of time if it exceeds the max record count limit,
# reevaluating the record counts over the new time periods, and continuing the process until
# a set of time periods is captured whose record counts all fall within the threshold.
# Detail records are then requested for each time period,
# converted from the API's native JSON to a csv format, and saved to a file.
################################################################################

# This directive may be necessary for proper SSL Protocol handling of a web request
[Net.ServicePointManager]::SecurityProtocol = [Net.SecurityProtocolType]::Tls12

# Capture the time that we are starting the run of this script
$ScriptRunTime = Get-Date -Format FileDateTime


################################################################################
# Given a date range, Identify a SET of date ranges that are small enough
# to pull complete data sets from the rate-limited Web API.
#

# Initialize an array list that will hold a set of date ranges.
$DateRangeList = New-Object System.Collections.ArrayList

# Set initial date range (StartDate, EndDate, RecordCount).
# Below are sample commands for using powershell to get date in this unix epoch time int format.
# Note that normal unix time is to seconds,
# but AppDynamics reports the time in milliseconds, thus the multiplying by 1000.
#   [int](Get-Date -Date "8/13/2018 00:00:00" -UFormat %s) * 1000
#   [int](Get-Date -UFormat %s) * 1000
#   [int64]((get-date "8/21/2018 12:00:00").ToUniversalTime()-(get-date "1/1/1970").ToUniversalTime()).TotalMilliseconds
[void]$DateRangeList.Add([Tuple]::Create($StartTimeLong,$EndTimeLong, -1))

# Initialize outer DO loop
$DoneDefiningRanges = 0

# Repeat until we have a set of date ranges all with record counts less than the max API threshold of 10000.
do {

    # Assume we are done defining ranges unless we find a reason not to be.
    $DoneDefiningRanges = 1
    
    # Initialize inner loop count.
    $LoopCount = $DateRangeList.Count
    
    for($i=0; $i -lt $LoopCount; $i++) {

        if ($DateRangeList[$i].Item3 -eq -1) {
            # Enforce a delay to make sure we are not calling the API more often than we are allowed.
            Start-Sleep -Milliseconds $APISimpleDelayBetweenCalls
        
            # Get a record count from the API
            Try {
            $RecordCountInRange = Invoke-WebRequest -Uri "https://analytics.api.appdynamics.com/events/query" `
                -Method Post `
                -Headers @{"X-Events-API-AccountName"="$($APIAccountName)"; "X-Events-API-Key"="$($APIKey)"; "Content-type"="application/vnd.appd.events+text;v=2"} `
                -Body "select count(*) 
                    FROM mobile_session_record 
                    where startTimeMS > $($DateRangeList[$i].Item1) and startTimeMS <= $($DateRangeList[$i].Item2)  limit 100" `
                | ConvertFrom-Json
            }
            Catch {
                Write-Output "Error encountered when invoking web request:`r`n $_.Exception.Message"
                return
            }
            # Display result record count so we can see what's going on as the script runs.
            Write-Output $RecordCountInRange.results
            
            
            if ($RecordCountInRange.results -gt $APIRecordLimit) {
            
                # If record count is greater than max API limit, split the date range in half.
                # Each of the 2 new date ranges will be rechecked for record count the next iteration of the outer DO loop.
                $Halfway = $DateRangeList[$i].Item1 + [INT](($DateRangeList[$i].Item2 - $DateRangeList[$i].Item1) / 2)
                [void]$DateRangeList.Add([Tuple]::Create($Halfway,$DateRangeList[$i].Item2, -1))
                $DateRangeList[$i] = [Tuple]::Create($DateRangeList[$i].Item1, $Halfway, -1)

                # We're not done defining ranges, so make sure the DO loop runs again.
                $DoneDefiningRanges = 0
            }
            else {
                # We have a valid date range to pull full details from.  Update record count in the date range tuple.
                $DateRangeList[$i] = [Tuple]::Create($DateRangeList[$i].Item1, $DateRangeList[$i].Item2, $RecordCountInRange.results)
            }
        }

    }

} while ($DoneDefiningRanges -eq 0)


################################################################################
# Get the actual Detail DATA from the Web API for each of the date ranges and STORE it.
#

# First sort the list so we processes the ranges in order.
$DateRangeList.Sort()

$DateRangeList |
ForEach-Object {
    # Display the date range and record count being processed.
    Write-Output $_

    # Only process date ranges that have records
    if ($_.Item3 -ne 0) {
        # Enforce a delay to make sure we are not calling the API more often than we are allowed.
        Start-Sleep -Milliseconds $APISimpleDelayBetweenCalls
            
        # Get the detail records from the API.
        # Limit 10000 is required in this query because without it the query will default to a very small limit of 100.
        Try {
        $RecordDetails = Invoke-WebRequest -Uri "https://analytics.api.appdynamics.com/events/query" `
            -Method Post `
            -Headers @{"X-Events-API-AccountName"="$($APIAccountName)"; "X-Events-API-Key"="$($APIKey)"; "Content-type"="application/vnd.appd.events+text;v=2"} `
            -Body "select mobileappname, metrics.durationMS, devicemodel, devicemanufacturer
                , osversion, connectiontype, carrier, mobileappversion, startTimeMS
                , eventTimestamp, metrics.crashcount, platform 
                FROM mobile_session_record 
                where startTimeMS > $($_.Item1) and startTimeMS <= $($_.Item2) limit $($APIRecordLimit)" `
            | ConvertFrom-Json
        }
        Catch {
            Write-Output "Error encountered when invoking web request:`r`n $_.Exception.Message"
            return
        }   
        # Because the results are within a property as separate elements,
        # we need to join them together into a comma-delimited string to get export-csv to export them correctly.
        # Also enclosing each element in double quotes so commas within elements will not mess up csv output.
        $RecordDetails.results | ForEach-Object {
            "`"" + ($_ -join "`",`"") + "`"" `
            |  ConvertFrom-CSV -Header $RecordDetails.fields.field `
            | Export-CSV -path "$($FileExportPath)\AppDynamicsUXSession-$($ScriptRunTime).csv" -Append -NoTypeInformation
        }
    }


}

 

Notes on the Script

The primary focus of this article is to demonstrate how to overcome Web API record count limits. You will, however, notice that the script includes a variable called APISimpleDelayBetweenCalls. This demonstrates a simple way to protect against another type of Web API rate limiting where you are restricted on how often you can call the API.

The AppDynamics Web API returns data in JSON format. A simple transformation is done in this script to convert the JSON to CSV. The data is then saved to a timestamped file. This makes it simple for testing purposes to browse the data and make sure you’re getting what you expect. It may not be a requirement for your needs, however. Maybe you have a destination in mind that can accept the JSON directly?

 

One Step of a Larger Process

You would likely use this process for pulling large data sets from a Web API as one step of a larger process. An ETL process might use this for example.  First, you’d query your own analytics database to get the last range of data you loaded. Then you’d feed that range into this process to retrieve all of the data since the process was last run. And finally, you would load the new data into your database.

 

Final Thoughts

I hope you found the processes described in this post useful. Feel free to like, share, or comment on the material covered here. And remember developers, don’t leave behind a legacy of obscurity. PLEASE include comments like those included in the above code in your coding projects.

Leave a Reply

Your email address will not be published. Required fields are marked *