Downloading Large Datasets
World-Check On Demand is a relatively large data set at around 45Gb for the entire uncompressed data on disk. Retrieving all this data for the first time will take time depending on the data retrieval approach.
Serial Data Retrieval
The simplest approach to retrieve all or a big portion of the data is by making an initial request (with or without filters) and paginating until the last page in the multi-paged response. Depending on the filters used, this approach may take a few hours to finish.
Recommendations to improve efficiency when using this approach:
- Ensure that your application have a robust error handling logic such that it will do an exponential backoff with potential intermittent errors (e.g., network timeouts, etc.) and continue where it left off
- You can extend the error handling by logging the next cursor, query hash, and other related details that your application sent in the failed request and recover the download session from there
- Asynchronously process and store the retrieved records as the next page is requested
- Use the cursor and queryHash from the response headers and request the next page before the response payload has been fully processed
Parallel Data Retrieval
A alternative approach for retrieving all World-Check records is by sending multiple simultaneous requests where each request is sending a targetted list of UIDs and uses the uid
filter.
For example, a 3-threaded parallel request will look like:
A. Starting state
- thread #1 requests for UIDs 1-1000
- thread #2 requests for UIDs 1001-2000
- thread #3 requests for UIDs 2001-3000
B. After the first cycle of each thread
- thread #1 will add 3000 (total threads + max UIDs per request) and requests for UIDs 3001-4000
- thread #2 requests for UIDs 4001-5000
- thread #3 requests for UIDs 5001-6000
The uid
filter does not return an error when a submitted UID value is not present in World-Check On Demand, it will simply not return it in the response.
You can design this approach such that each thread will stop after a specified number of empty responses, which suggests that there are no more records to retrieve.
Working samples codes are provided in the Downloads section that implements this approach.