Hierarchical consideration is quicker
This text requires you to have information of normal transformers and the way they work. If you’re a newbie and also you’d wish to find out about transformers, please check out Transformer for Novices article.
In Hierarchical Transformer — half 1 we outlined, what we imply by “hierarchical transformers”, and we reviewed certainly one of distinguished work on this area which was known as Hourglass.
On this article, we’ll proceed the road of labor by wanting into one other well-known work known as Hierarchical Consideration Transformers (HAT).
Let’s get began.
This methodology was initially proposed for classifying lengthy paperwork, usually in size of 1000’s of phrases. A usecase of that is classifying authorized paperwork or biomedical paperwork that are usually very lengthy.
Tokenization and Segmentation
The HAT methodology works by taking an enter doc, and tokenizing it utilizing Byte-Pair Encoding (BPE) tokenizer that breaks textual content into subwords/tokens. This tokenizer is utilized in many well-known giant language fashions corresponding to BERT, RoBERTA and GPT household.
Then it splits the tokenized doc into N equally-sized chunks; i.e. if S denote the enter doc then S = (C1, …., CN) are N equally-sized chunks. (By way of out this text, we generally confer with chunks as segments, however they’re the identical idea.) Every chunk is a sequence of ok tokens Ci = (Wi(cls), Wi1…, Wik-1) that the primary token, Wi(cls), is the CLS token which represents the chunk.
As we see in picture above, each chunk is a sequence of ok tokens, the place the primary token is the CLS token.
After tokenizing and segmenting the enter sequence, it feeds it to the HAT transformer mannequin. The HAT mannequin is an encoder-transformer and consists of two fundamental parts:
segment-wise encoder (SWE): this can be a shared encoder block that takes in sequence of a section (aka chunk) and processes the chunk.cross-segment encoder (CSE): that is one other encoder block that takes is CLS tokens of all segments (aka chunks) and course of cross-segment relations.