Bank transaction categorisation: keeping up with the ever-moving goalposts

Atto's Senior Data Scientist, Andy Spence, delves into the winding world of bank transaction categorisation and explains how we keep up with the ever-moving goalposts.
Published on
February 6, 2023
Author
Andy Spence, Senior Data Scientist
Category
Finance & Fintech

Introduction

The ability to produce per transaction categories for a bank account is crucial to identifying a person's creditworthiness.  Understanding what they bought, why they received a payment and how often provides valuable insight into their spending behaviour.

It’s a straight forward, if somewhat laborious, task for an underwriter to scan through an account and total the amounts for both expenditure and income - corresponding to categories relevant to the in-house business model.  But, doing this all day long for tens of applications can cause fatigue and errors. And the paraments for categorisation are everchanging.

Since the value-add concerns the result of the analysis of a bank account, using Atto to automate this process is essential to speed up the process -  ultimately providing a faster decision for a customer.

Unstructured text

Here's how it's possible.

With the arrival of text analysis algorithms based on statistical and machine learning techniques from the field of natural language processing (NLP), we can now parse text programmatically in order to derive meaning from which information can be extracted.

These approaches work on unstructured text, which can be sourced from emails or scraped from blog posts or Twitter feeds and so on; in contrast to structured text of the kind organised nicely into relational databases and available via query e.g. name & address.  Successful NLP applications include sentiment analysis, topic analysis, spam detection & personally identifiable information (PII) removal.

So, the description field of a banking transaction fulfils the definition of unstructured text – you could reasonably consider applying these techniques to reveal elements of its structure.  However, transaction description text doesn’t read like the content of an email.  There is a distinct lack of familiar grammar and without a well-defined syntax it’s more challenging to apply these algorithms with success. Assuming that we don’t have millions of labelled transactions to hand, ready to train a single all-encompassing predictive machine learning model, a little more creativity is required.

Merchant identification

Performing merchant identification separately is a good place to start because many merchants map directly to a single category i.e. the field in which they do business. Debit transactions commonly correspond to purchases from shopping trips to various stores or online.  A good source for merchant information in the UK is Companies House simply because every legit business is registered there.  But, upon inspecting Company House data you’ll realise that they hold official company details rather than the familiar name or brand used in practice and this is unlikely to be useful in this case.

Even knowledge of the most familiar name or brand for a merchant may not be sufficient on its own.  A combination of factors such as humans in the loop and limited length of description, the merchant representation can often appear abbreviated, truncated or simply misspelled.  Take the UK supermarket Sainsbury's for example: we’ve seen the chain written in an astonishing variety of ways.

Understanding merchant descriptions is vital

The upshot is that knowing how merchants are commonly presented in bank transaction descriptions is vital since this forms the basis of a dataset that can be used to train a predictive model. To get this knowledge we need two things: access to bank transaction data and the sterling effort of a team of annotators to peruse and annotate it. Open Banking APIs provide the former.  The latter is a significant undertaking but one that is, at the time of writing, unavoidable because if the model hasn’t ‘seen’ a certain spelling of a merchant, it won’t be able to guess or infer it - because bank transaction descriptions are not natural language.

Annotation resource is of course finite and so it’s impossible to include everything from multinational chains down to a village shop or independent café.

For an initial model it may be sufficient to concentrate on the main players for each category to ensure coverage is reasonable.  Future iterations can incorporate increasing levels of detail but involve more work for less gain in model performance. This might not add significant value unless a particular category is extremely important to the use case.

Keeping a model trained

Another challenging facet of merchant identification is that it’s essentially a moving target. Companies go bust, re-brand, merge and are newly formed.  In addition new ways to pay are offered such as buy now pay later.  If our models aren’t re-trained with new information covering these changes on a regular basis, they will soon become stale and their performance will take a hit.  To keep the models up to date, annotation effort is an ongoing full-time task.

Despite the challenges outlined above, training a predictive model to identify the presence of the merchants in a transaction description is certainly achievable and is how we step forward in terms of the overall objective of transaction categorisation.

Of course there may be competing signals for other categories within the text so identifying the merchant might not quite be the whole story.  Ninety-nine times out of a hundred Sainsbury’s will correspond to a supermarket purchase but what if there’s an acronym at the end of the description, something like ATM? In this case money has been withdrawn at an automated teller machine at or near the supermarket so the category would be cash rather than supermarket.

Keywords and acronyms

Another common acronym is POS  i.e. point of sale.  In the event of an unknown merchant the model will be able to infer from the presence of this acronym that at least it’s a retail transaction in the general sense, which is still useful information.  it’s therefore important to be able to detect common banking terms such as these in addition to merchant names. Programmatically this can be achieved by implementing pattern matching based on regular expressions aka regex.

With regard to credit transactions the main category of interest is usually salary.  If we were to rely upon keywords for their detection, those transactions without ‘salary’, ‘wages’ or ‘payroll’ in their description would likely be categorised as a more generic category like deposit or even transfer payment.  An alternative approach is to examine a different aspect of salary payments:  how many times such transactions repeat and with what frequency.  This can be achieved programmatically through the application of string similarity algorithms on the description field: this processing groups together credit transactions which have the same or at least similar descriptions.  The resulting groups can then be analysed to determine the primary sources of income for an account even in the absence of salary keywords.

Summary

The categorisation of bank transactions is the most important tool in the analysis of credit risk - with Atto you can automate this laborious task.  For categorisation, Atto uses a variety of methods (listed in this article) to provide merchant identification, pattern matching and recurring transaction analysis.  Whilst the latter can be generally applied in any geo, merchant identification and to a certain extent, pattern matching, are location-specific,  making annotation effort a necessity to produce geo-specific models. This makes sense as it’s comparable to visiting or moving to a country for the first time and getting to know the local merchants.

To find out more about Atto's bank account categorisation service or open banking, get in touch here.

Stay up to date with the latest
Atto news

Get the latest news, thought leadership, product updates and customer stories straight to your inbox - subscribe to our blog today.