This tutorial is the third in a series on Tick History data inside Google BigQuery.
The preceding tutorial explained how to retrieve tick data, and filter it, using SQL.
This tutorial introduces actual analytics on the data.
The following tutorial will go further, giving you tips on how to optimize queries.
For your convenience, a set of sample queries is available in a downloadable file.
There are many use cases for analytics based on tick history data, that cover typical needs of the front and middle office; for examples, refer to the Tick History in Google BigQuery article.
As a first example, let us analyse quotes. We shall calculate several statistics. Among these, the number of quotes is an indication of liquidity, whereas large swings in bid and ask prices or large percentage spread draw attention to instruments that could deliver interesting trading opportunities.
We perform the analysis using SQL, which has a set of built-in functions to count items, find minimum and maximum values, perform calculations, compare strings, and much more. We use these to enhance our query results. If you are a seasoned SQL programmer, you can also create your own User Defined Functions (UDF).
The following query, based on quotes, analyses all available instruments on the London Stock Exchange. For each one it calculates a set of statistics. It also filters out irrelevant quotes based on qualifiers.
We sort the results by the number of quotes, to focus on finding liquidity.
#StandardSQL
SELECT
RIC,
COUNT(RIC) AS QuotesCount,
MIN(Bid_Price) AS MinBid,
MAX(Bid_Price) AS MaxBid,
MIN(Ask_Price) AS MinAsk,
MAX(Ask_Price) AS MaxAsk,
ROUND(SAFE_DIVIDE(100*(MAX(Bid_Price)-MIN(Bid_Price)),AVG(Bid_Price)),3) AS DeltaBidPercent,
ROUND(SAFE_DIVIDE(100*(MAX(Ask_Price)-MIN(Ask_Price)),AVG(Ask_Price)),3) AS DeltaAskPercent,
ROUND(AVG(Bid_Price),3) AS AvgBid,
ROUND(AVG(Ask_Price),3) AS AvgAsk,
ROUND(AVG(Ask_Price-Bid_Price),5) AS AverageSpread,
ROUND(AVG(SAFE_DIVIDE(100*(Ask_Price-Bid_Price),(Ask_Price+Bid_Price)/2)),5) AS AverageSpreadPercent
FROM
`dbd-sdlc-prod.LSE_NORMALISED.LSE_NORMALISED`
WHERE
RIC LIKE "%.L%"
AND (Date_Time BETWEEN TIMESTAMP('2019-09-04 00:00:00.000000') AND
TIMESTAMP('2019-09-04 23:59:59.999999'))
AND Type="Quote"
AND ((Bid_Price IS NOT NULL AND Bid_Size IS NOT NULL) OR
(Ask_Price IS NOT NULL AND Ask_Size IS NOT NULL))
AND NOT (Qualifiers LIKE 'M[IMB_SIDE]')
AND NOT (Qualifiers LIKE 'M[ASK_TONE]')
AND NOT (Qualifiers LIKE 'M[BID_TONE]')
AND NOT (Qualifiers LIKE '[BID_TONE];[ASK_TONE]')
AND NOT (Qualifiers LIKE 'A[BID_TONE];A[ASK_TONE]')
GROUP BY
RIC
ORDER BY
QuotesCount DESC
Here is the result:
Note: NORMALISED data is level 1 data, the bid and ask prices are always the best bid and ask. For a more detailed view of the offers, we could have used NORMALISEDLL2 (market depth) data, which delivers a view of the top of the order book.
As you can see, this is quite powerful: all ticks for the full universe of LSE instruments for an entire day (12GB of data) were analysed in approximately 3 seconds to deliver 11 different calculated values for each of the 5600 instruments !
Let us now see another example, based on trades, again for all instruments on the LSE. For each one it calculates the number of trades, the total traded volume, the minimum and maximum price, the price variation, and a few trade benchmarks such as VWAP, average price and volume. It also filters out irrelevant trades based on qualifiers.
We sort the results by the total volume, to focus on high trading volume instruments:
#standardSQL
SELECT
RIC,
ROUND(SAFE_DIVIDE(SUM(Volume*Price),SUM(Volume)),3) AS VWAP,
COUNT(RIC) AS TradesCount,
SUM(Volume) AS TotalVolume,
ROUND(AVG(Volume),0) AS AvgTradeVolume,
MIN(Price) AS MinPrice,
MAX(Price) AS MaxPrice,
ROUND(SAFE_DIVIDE(100*(MAX(Price)-MIN(Price)),AVG(Price)),3) AS DeltaPricePercent,
ROUND(AVG(Price),3) AS AvgPrice
FROM
`dbd-sdlc-prod.LSE_NORMALISED.LSE_NORMALISED`
WHERE
RIC LIKE "%.L%"
AND (Date_Time BETWEEN TIMESTAMP('2019-09-04 00:00:00.000000') AND
TIMESTAMP('2019-09-04 23:59:59.999999'))
AND Type="Trade"
AND Volume >0
AND Price >0
AND NOT (Qualifiers LIKE 'Off Book Trades%')
AND NOT (Qualifiers LIKE '%Previous Day Trade%')
AND NOT (Qualifiers LIKE '%CLS%')
AND NOT (Qualifiers LIKE 'U[ACT_FLAG1];U[CONDCODE_1]%')
AND NOT (Qualifiers LIKE '46---A---P----[MMT_CLASS]')
AND NOT (Qualifiers LIKE '46-1-A---P----[MMT_CLASS]')
AND NOT (Qualifiers LIKE '47---A---J----[MMT_CLASS]')
AND NOT (Qualifiers LIKE '47---A---P----[MMT_CLASS]')
AND NOT (Qualifiers LIKE '%Auction%')
GROUP BY
RIC
ORDER BY
TotalVolume DESC
Here is the result:
If you closely examined the quote and trade analytics code we have just seen, you might have noticed that the code was not written in an optimal manner: the maximum and minimum values were calculated in two different places, thus duplicating some of the effort. The next tutorial will come back to this topic, with a more efficient solution.
Let us see a final example, based on market depth data. We shall calculate a set of barycenter values. A barycenter is a mean weighted mean value. Here is the formula:
And here is the code:
#standardSQL
WITH
RAWSTATS AS(
SELECT
RIC,
COUNT(RIC) AS QuotesCount,
SUM((L1_BidPrice * L1_BidSize) + (L2_BidPrice * L2_BidSize) + (L3_BidPrice * L3_BidSize) + (L4_BidPrice * L4_BidSize) + (L5_BidPrice * L5_BidSize)) AS bidPriceVol,
SUM((L1_AskPrice * L1_AskSize) + (L2_AskPrice * L2_AskSize) + (L3_AskPrice * L3_AskSize) + (L4_AskPrice * L4_AskSize) + (L5_AskPrice * L5_AskSize)) AS askPriceVol,
SUM(L1_BidSize + L2_BidSize + L3_BidSize + L4_BidSize + L5_BidSize) AS bidVol,
SUM(L1_AskSize + L2_AskSize + L3_AskSize + L4_AskSize + L5_AskSize) as askVol
FROM
`dbd-sdlc-prod.LSE_NORMALISEDLL2.LSE_NORMALISEDLL2`
WHERE
(RIC LIKE "VOD.L" OR RIC LIKE "ZINC.L")
AND (Date_Time BETWEEN TIMESTAMP('2019-09-04 10:00:00.000000') AND
TIMESTAMP('2019-09-04 10:00:04.999999'))
AND (
((L1_BidPrice IS NOT NULL AND L1_BidSize IS NOT NULL) OR
(L1_AskPrice IS NOT NULL AND L1_AskSize IS NOT NULL))
OR ((L2_BidPrice IS NOT NULL AND L2_BidSize IS NOT NULL) OR
(L2_AskPrice IS NOT NULL AND L2_AskSize IS NOT NULL))
OR ((L3_BidPrice IS NOT NULL AND L3_BidSize IS NOT NULL) OR
(L3_AskPrice IS NOT NULL AND L3_AskSize IS NOT NULL))
OR ((L4_BidPrice IS NOT NULL AND L4_BidSize IS NOT NULL) OR
(L4_AskPrice IS NOT NULL AND L4_AskSize IS NOT NULL))
OR ((L5_BidPrice IS NOT NULL AND L5_BidSize IS NOT NULL) OR
(L5_AskPrice IS NOT NULL AND L5_AskSize IS NOT NULL))
)
GROUP BY
RIC
)
SELECT
RIC,
QuotesCount,
ROUND(SAFE_DIVIDE(askPriceVol,askVol),3) AS AskPriceBaryCenter,
ROUND(SAFE_DIVIDE(bidPriceVol,bidVol),3) AS BidPriceBaryCenter,
ROUND(SAFE_DIVIDE(askPriceVol+bidPriceVol,askVol+bidVol),3) AS BaryCenter
FROM
RAWSTATS
ORDER BY
RIC
LIMIT
100000
Here is the result:
As a final example, let us take a trade we executed on the 4th of September 2019 at 14:10:30, at a price of 157.4, and compare it to the market in the same minute, to see how well we performed.
Here is the code, inside which we must input our trade price:
#standardSQL
# Compare your trade execution with the market.
# Inputs: line 48 (your trade price), 18 & 43 (instrument), 19 & 44 (date time range).
WITH
RAWSTATS AS(
SELECT
SAFE_DIVIDE(SUM(Volume*Price),SUM(Volume)) AS VWAP,
COUNT(RIC) AS TradesCount,
SUM(Volume) AS TotalVolume,
AVG(Volume) AS AvgVolume,
MIN(Price) AS MinPrice,
MAX(Price) AS MaxPrice,
AVG(Price) AS AvgPrice
FROM
`dbd-sdlc-prod.LSE_NORMALISED.LSE_NORMALISED`
WHERE
RIC LIKE "VOD.L"
AND (Date_Time BETWEEN TIMESTAMP('2019-09-04 14:10:00.000000') AND
TIMESTAMP('2019-09-04 14:11:00.000000'))
AND Type="Trade"
AND Volume >0
AND Price >0
AND NOT (Qualifiers LIKE 'Off Book Trades%')
AND NOT (Qualifiers LIKE '%Previous Day Trade%')
AND NOT (Qualifiers LIKE '%CLS%')
AND NOT (Qualifiers LIKE 'U[ACT_FLAG1];U[CONDCODE_1]%')
AND NOT (Qualifiers LIKE '46---A---P----[MMT_CLASS]')
AND NOT (Qualifiers LIKE '46-1-A---P----[MMT_CLASS]')
AND NOT (Qualifiers LIKE '47---A---J----[MMT_CLASS]')
AND NOT (Qualifiers LIKE '47---A---P----[MMT_CLASS]')
AND NOT (Qualifiers LIKE '%Auction%')
),
BETTERPRICE AS(
SELECT
COUNT(RIC) AS HigherPricedTradesCount,
SUM(Volume) AS TotalHigherPricedVolume,
SAFE_DIVIDE(SUM(Volume*Price),SUM(Volume)) AS HigherPricedVWAP
FROM
`LSE_NORMALISED.LSE_NORMALISED`
WHERE
RIC LIKE "VOD.L"
AND (Date_Time BETWEEN TIMESTAMP('2019-09-04 14:10:00.000000') AND
TIMESTAMP('2019-09-04 14:11:00.000000'))
AND Type="Trade"
AND Volume >0
AND Price>157.4
AND NOT (Qualifiers LIKE 'Off Book Trades%')
AND NOT (Qualifiers LIKE '%Previous Day Trade%')
AND NOT (Qualifiers LIKE '%CLS%')
AND NOT (Qualifiers LIKE '46---A---P----[MMT_CLASS]')
AND NOT (Qualifiers LIKE '46-1-A---P----[MMT_CLASS]')
AND NOT (Qualifiers LIKE '47---A---J----[MMT_CLASS]')
AND NOT (Qualifiers LIKE '47---A---P----[MMT_CLASS]')
AND NOT (Qualifiers LIKE '%Auction%')
)
SELECT
TradesCount,
TotalVolume,
ROUND(AvgVolume,0) AS AvgVolume,
MinPrice,
MaxPrice,
ROUND(AvgPrice,3) AS AvgPrice,
ROUND(VWAP,3) AS VWAP,
HigherPricedTradesCount,
ROUND(SAFE_DIVIDE(100*HigherPricedTradesCount,TradesCount),3) AS HigherPricedTradesPercent,
ROUND(TotalHigherPricedVolume,9) AS TotalHigherPricedVolume,
ROUND(HigherPricedVWAP,3) AS HigherPricedVWAP
FROM
RAWSTATS, BETTERPRICE
Here is the result, which shows how our trade is positioned compared to the market.
Note that this code, written in a fairly straightforward way to make it easy to understand, is a bit awkward, as the instrument and date time range must be input in 2 different places.
This puts us on the path to query optimization, which is the topic of the next tutorial.
We have seen how to use SQL functions to analyse data. Please proceed to the next tutorial that covers query optimization.