Downloading Large Datasets

World-Check On Demand is a relatively large data set at around 45Gb for the entire uncompressed data on disk. Retrieving all this data for the first time will take time depending on the data retrieval approach.

Serial Data Retrieval

The simplest approach to retrieve all or a big portion of the data is by making an initial request (with or without filters) and paginating until the last page in the multi-paged response. Depending on the filters used, this approach may take a few hours to finish.

Recommendations to improve efficiency when using this approach:

Ensure that your application have a robust error handling logic such that it will do an exponential backoff with potential intermittent errors (e.g., network timeouts, etc.) and continue where it left off
You can extend the error handling by logging the next cursor, query hash, and other related details that your application sent in the failed request and recover the download session from there
Asynchronously process and store the retrieved records as the next page is requested
Use the cursor and queryHash from the response headers and request the next page before the response payload has been fully processed

Parallel Data Retrieval

A alternative approach for retrieving all World-Check records is by sending multiple simultaneous requests where each request is sending a targetted list of UIDs and uses the uid filter.

For example, a 3-threaded parallel request will look like:

A. Starting state

thread #1 requests for UIDs 1-1000
thread #2 requests for UIDs 1001-2000
thread #3 requests for UIDs 2001-3000

B. After the first cycle of each thread

thread #1 will add 3000 (total threads + max UIDs per request) and requests for UIDs 3001-4000
thread #2 requests for UIDs 4001-5000
thread #3 requests for UIDs 5001-6000

The uid filter does not return an error when a submitted UID value is not present in World-Check On Demand, it will simply not return it in the response.

You can design this approach such that each thread will stop after a specified number of empty responses, which suggests that there are no more records to retrieve.

Working samples codes are provided in the Downloads section that implements this approach.

Serial Data Retrieval​

Parallel Data Retrieval​

Serial Data Retrieval

Parallel Data Retrieval