This tutorial is the fourth in a series on Tick History data inside Google BigQuery.
The preceding tutorial explained how to perform analytics on data.
For your convenience, a set of sample queries is available in a downloadable file.
As we saw in the previous tutorial, our code might benefit from some tuning.
In the last example the passing of parameters was a bit awkward.
But optimization can also be applied to increase performance and reduce costs, as was hinted at when we saw the quote and trade analytics.
If you studied the code closely, you might have noticed that the Quote Statistics query was not written in an optimal manner: the same maximum, minimum and average values were calculated in different places, thus duplicating some of the effort. Writing our queries in a different way can reduce the server load by avoiding such inefficiencies, and thus deliver results even faster.
As an example, here is a second version of the quote analytics, which avoids calculation duplication:
#StandardSQL
WITH
RAWSTATS AS(
SELECT
RIC,
COUNT(RIC) AS QuotesCount,
MIN(Bid_Price) AS MinBid,
MAX(Bid_Price) AS MaxBid,
MIN(Ask_Price) AS MinAsk,
MAX(Ask_Price) AS MaxAsk,
AVG(Bid_Price) AS AvgBid,
AVG(Ask_Price) AS AvgAsk,
AVG(Ask_Price-Bid_Price) AS AverageSpread,
AVG(SAFE_DIVIDE(100*(Ask_Price-Bid_Price),(Ask_Price+Bid_Price)/2)) AS AverageSpreadPercent
FROM
`dbd-sdlc-prod.LSE_NORMALISED.LSE_NORMALISED`
WHERE
RIC LIKE "%.L%"
AND (Date_Time BETWEEN TIMESTAMP('2019-09-04 00:00:00.000000') AND
TIMESTAMP('2019-09-04 23:59:59.999999'))
AND Type="Quote"
AND ((Bid_Price IS NOT NULL AND Bid_Size IS NOT NULL) OR
(Ask_Price IS NOT NULL AND Ask_Size IS NOT NULL))
AND NOT (Qualifiers LIKE 'M[IMB_SIDE]')
AND NOT (Qualifiers LIKE 'M[ASK_TONE]')
AND NOT (Qualifiers LIKE 'M[BID_TONE]')
AND NOT (Qualifiers LIKE '[BID_TONE];[ASK_TONE]')
AND NOT (Qualifiers LIKE 'A[BID_TONE];A[ASK_TONE]')
GROUP BY
RIC
)
SELECT
RIC,
QuotesCount,
MinBid,
MaxBid,
MinAsk,
MaxAsk,
ROUND(SAFE_DIVIDE(100*(MaxBid-MinBid),AvgBid),3) AS DeltaBidPercent,
ROUND(SAFE_DIVIDE(100*(MaxAsk-MinAsk),AvgAsk),3) AS DeltaAskPercent,
ROUND(AvgBid,3) AS AvgBid,
ROUND(AvgAsk,3) AS AvgAsk,
ROUND(AverageSpread,5) AS AverageSpread,
ROUND(AverageSpreadPercent,5) AS AverageSpreadPercent
FROM
RAWSTATS
ORDER BY
QuotesCount DESC
Even though this second version is longer than the previous one, it runs approximately 8% faster.
The trade analytics sample we saw previously has the same flaw. After optimisation it also benefits from a similar performance increase.
This introduced us to the topic of query optimization.
There are some general rules you can apply to ensure your queries run in an efficient manner, deliver results quickly, and cost as little as possible:
1. Minimize the amount of data you query and analyse
Note: when working in the console, if the query syntax is correct, the size of the data set that will be processed is displayed beneath the editor window:
2. Avoid repeatedly transforming data via your SQL queries
This was illustrated by the two versions of the query we just saw.
3. Order operations to maximize performance
4. Use a LIMIT clause with large sorts
Following these 4 simple rules will help you optimize your queries. For more detailed information and tips, refer to the following Google documentation:
With a UDF (User Defined Function) you can create a function using an SQL expression or JavaScript code, which can even be extended by importing external libraries.
UDFs can either be persistent or temporary. Persistent UDFs can be reused across multiple queries, whereas temporary UDFs can only be used in a single query.
What is the impact of a UDF ?
Let us consider a simple mean quote price calculation. This can be calculated directly in a SQL query:
ROUND((Bid_price+Ask_Price)/2,4) AS Mid_Price,
This can also be implemented in a UDF. Here is the same calculation in a UDF that uses SQL code:
CREATE TEMPORARY FUNCTION
Mid_Price(bid FLOAT64, ask FLOAT64)
RETURNS FLOAT64
AS (ROUND((bid+ask)/2,4));
Finally, here is the same calculation in a UDF that uses JavaScript code:
CREATE TEMPORARY FUNCTION
Mid_Price(bid FLOAT64, ask FLOAT64)
RETURNS FLOAT64
LANGUAGE js AS "return Math.round(((bid+ask)/2)*10000)/10000;";
How do these 3 versions compare from a pure performance point of view ?
This shows that creating a UDF does not have a performance impact if coded in SQL, but using JavaScript code will slow it down significantly. It is therefore best to limit JavaScript to cases that cannot be implemented using SQL.
For more information on UDFs, their implementation, best practices, and limits, refer to the Google BigQuery UDF documentation and best practices.
Using the techniques we have just seen, you can build analytics based on quotes, trades or market depth data, summarise data in bars, calculate benchmarks, compare execution prices, find interesting instruments by volume or price change, and much more.
The downloadable sample set gives you all the SQL query examples we have seen, including some optimised versions:
Tick examples:
Quote examples:
Trade examples:
Market depth examples:
As we have seen, it is very easy to access LSEG Tick History data and analyse Gigabytes of data it in a matter of seconds using Google BigQuery. As the requests rely entirely on SQL, you don’t need any other skills to create powerful analytics in a very short time.
If you would like to dig further, here is a summary of all the reference links mentioned in this tutorial set: