Using SELMR#

Introduction#

A Simple Explainable Language Multiset Representation (SELMR) is a text data structure that works like a language model. The SELMR data structure consists of multisets created from all phrases (i.e. multiword expressions) and all phrase-context combinations contained in a collection of documents given some contraints. The multisets can be used for downstream NLP tasks like text classifications and searching, similar to real-valued vector embeddings.

SELMRs produce explainable results without any randomness and enable explicit links with lexical, linguistical and terminological annotations. No model is trained to get real-valued vector embeddings and no dimensionality reduction is applied.

The building blocks for SELMR are phrases and contexts.

A phrase can be any list of consecutive words that occurs in a collection of documents, with a certain maximum length and a certain minimum number of occurrences.
A context of a phrase is the combination of the preceding list of words (the left side) and the following list of words (the right side) of that phrase, with a certain maximum length and a certain minimum number of occurrences.

Of each phrase the SELMR data structure contains the contexts in which the phrase occurs with the number of that phrase-context combinations in the documents (forming a multiset or a collections.Counter in Python). Of each context the SELMR data structure contains the multiset with the phrases that occur in the context with their respective number of occurrences in the documents.

documents = [
    "We walked in the beautiful park.",
    "Then we did some shopping in the city."
]

from selmr import SELMR

# Create a SELMR data structure given the two sentences above
selmr = SELMR(
    documents=documents
)

selmr.contexts("city")

Multiset({('the', '.'): 1, ('in the', '.'): 1, ('shopping in the', '.'): 1})

selmr.contexts("beautiful park")

Multiset({('the', '.'): 1, ('in the', '.'): 1, ('walked in the', '.'): 1})

selmr.most_similar("city")

Multiset({'city': 3, 'beautiful park': 2})

SELMR based on DBpedia#

These are results of a SELMR created with 10.000 DBpedia pages. We defined a context of a word in it simplest form: the tuple of the previous multiwords and the next multiwords (no preprocessing, no changes to the text, i.e. no deletion of stopwords and punctuation). The maximum phrase length is five words, the maximum left and right context length is also five words.

import logging, sys
logging.basicConfig(stream=sys.stdout,
                    format='%(asctime)s %(message)s',
                    level=logging.DEBUG)

from selmr import SELMR, LanguageMultisets

# construct a SELMR data structure with the DBpedia phrases and contexts
selmr = SELMR(
    path="..//data//dbpedia_10000",
    params={"uncased": False, "lemmatized": False}
)

Most frequent contexts of a phrase#

The ten most frequent contexts in which the word ‘has’ occurs with their number of occurrences are the following:

# most frequent contexts of the word "has"
selmr.contexts("has", topn=10)

This results in

Multiset({('It', 'been'): 2014,
          ('it', 'been'): 1970,
          ('SENTSTART It', 'been'): 1858,
          ('and', 'been'): 1201,
          ('which', 'been'): 987,
          ('that', 'been'): 813,
          ('and', 'a'): 806,
          ('also', 'a'): 774,
          ('there', 'been'): 764,
          ('which', 'a'): 624})

This means that the corpus contains … occurrences of ‘It has been’, i.e. occurrences where the word ‘has’ occurred in the context (‘It’, ‘been’).

SENTSTART and SENTEND are tokens to indicate the start and end of a sentence. This makes it possible to derive the contexts of a phrase that starts or ends a sentence.

Phrase and context frequencies#

The contexts in which a word occurs represent to some extent the properties and the meaning of a word. If you derive the phrases that share the most frequent contexts of the word ‘has’ then you get the following table (the columns contains the contexts, the rows the phrases that have the most contexts in common):

import pandas as pd
pd.DataFrame().from_dict(
    selmr.dict_phrases_contexts("has", topcontexts=10), orient='tight'
)

This results in:

                 It     it      SENTSTART It    and 	which 	that 	and 	also 	there 	which
                 been   been    been            been 	been 	been 	a 	a 	been 	a
has              2014   1970    1858            1201 	987 	813 	806 	774 	764 	624
had              139    815     130             327 	1696 	1388 	623 	524 	350 	306
would have       26     156     25              19 	110 	97 	2 	2 	31 	4
may have         48     151     48              113 	146 	85 	6 	0 	60 	6
have             0      19      0               412 	477 	942 	370 	299 	773 	185
has not          14     104     14              34 	27 	47 	0 	0 	10 	0
could have       2      42      2               4 	17 	40 	0 	0 	7 	0

The contexts that a word has in common with contexts of another word can be used as a measure of similarity. The word ‘had’ (second row) has eight contexts in common with the word ‘has’ so this word is very similar. The phrase ‘would have’ (seventh row) has seven contexts in common, so ‘would have’ is also similar but less similar than the word ‘had’. We used a limited number of contexts to show the idea; normally a higher number of contexts can be used to compare the similarity of words.

The word similarities found can in this case explained as follows. Similar words are forms of the verb ‘have’. This is because the verb is often used in the construction of perfect tenses where the verb ‘have’ is combined with the past participle of another verb, in this case the often occuring ‘been’. Note that the list contains ‘has not’.

Phrase similarities#

Based on the approach above we can derive top phrase similarities.

# top phrase similarities of the word "has"
selmr.most_similar("has been suggested", topn=10, topcontexts=25)

This results in

Multiset({'has been suggested': 25,
          'is possible': 15,
          'is believed': 13,
          'is likely': 12,
          'is thought': 11,
          'is known': 10,
          'has been argued': 10,
          'is estimated': 10,
          'has been speculated': 10,
          'appears': 10})

Now take a look at similar words of ‘larger’.

# top phrase similarities of the word "larger"
selmr.most_similar("larger", topn=10, topcontexts=15)

Resulting in:

Multiset({'larger': 15,
          'smaller': 14,
          'higher': 13,
          'greater': 13,
          'longer': 11,
          'faster': 11,
          'less': 11,
          'more': 10,
          'better': 10,
          'shorter': 10})

Like the word ‘larger’ these are all comparative adjectives. These words are similar because they share the most frequent contexts, in this case contexts like (is, than) and (much, than).

# top phrase similarities of the word "might"
selmr.most_similar("might", topn=10, topcontexts=25)

Multiset({'might': 25,
          'may': 25,
          'should': 25,
          'would': 25,
          'could': 25,
          'must': 24,
          'would not': 22,
          'will': 22,
          'can': 21,
          'may not': 21})

Most frequent coinciding contexts are in this case (‘it’, ‘be’), (‘he’, ‘have’) and (‘that’, ‘be’).

Contexts can also be used to find ‘semantic’ similarities.

# top phrase similarities of the word "King"
selmr.most_similar("king", topn=10, topcontexts=25)

This results in

Multiset({'king': 25,
          'King': 25,
          'ruler': 23,
          'Emperor': 22,
          'emperor': 22,
          'president': 21,
          'Queen': 21,
          'President': 21,
          'head': 20,
          'Prime Minister': 20})

Instead of single words we can also find the similarities of multiwords

# top phrase similarities of Barack Obama
selmr.most_similar("Barack Obama", topn=10, topcontexts=25)

Multiset({'Barack Obama': 17,
          'Ronald Reagan': 6,
          '': ,
          'of the United States': 5,
          'Franklin D Roosevelt': 4,
          'Bill Clinton': 4,
          'George W Bush': 4,
          'Bush': 4,
          'Lyndon B Johnson': 4,
          'Lukashenko': 3})

Most frequent phrases of a context#

Here are some examples of the most frequent phrases of a context.

context = ("King", "of England")
for r in selmr.phrases(context, topn=10).items():
    print(r)

('Charles II', 59)
('Charles I', 42)
('James I', 38)
('Henry VIII', 34)
('Edward I', 27)
('James II', 25)
('Henry VII', 24)
('John', 20)
('Henry III', 19)
('Edward III', 18)

context = ("the", "city")
for r in selmr.phrases(context, topn=10).items():
    print(r)

('capital', 355)
('largest', 266)
('inner', 120)
('old', 114)
('ancient', 92)
('first', 91)
('second largest', 88)
('capital and largest', 84)
('host', 66)
('centre of the', 60)

context = ("he", "that")
for r in selmr.phrases(context, topn=10).items():
    print(r)

('believed', 244)
('said', 207)
('stated', 194)
('argued', 148)
('felt', 135)
('wrote', 115)
('argues', 108)
('found', 97)
('noted', 89)
('claimed', 88)

Phrase similarities given a specific context#

Some phrases have multiple meanings. Take a look at the contexts of the word ‘deal’:

selmr.contexts("deal", topn=10)

This results in:

Multiset({('to', 'with'): 946,
          ('great', 'of'): 649,
          ('a great', 'of'): 610,
          ('to', 'with the'): 332,
          ('a', 'with'): 225,
          ('good', 'of'): 86,
          ('a good', 'of'): 83,
          ('had to', 'with'): 71,
          ('the', 'SENTEND'): 57,
          ('a', 'to'): 53})

In some of these contexts ‘deal’ is a verb meaning ‘to do business’ and in other contexts ‘deal’ is a noun meaning a ‘contract’ or an ‘agreement’. The specific meaning can be derived from the context in which the phrase is used.

It is possible to take into account a specific context when using the most_similar function in the following way:

selmr.most_similar(phrase="deal", context=("to", "with"), topcontexts=50, topphrases=100, topn=10)

The result is:

Multiset({'deal': 50,
          'work': 21,
          'compete': 10,
          'comply': 8,
          'cope': 8,
          'communicate': 7,
          'do': 6,
          'live': 6,
          'meet': 5,
          'be confused': 4})

So these are all verbs, similar to the verb ‘deal’.

selmr.most_similar(phrase="deal", context=("a", "with"), topcontexts=50, topphrases=100, topn=10)

In this case the result is:

Multiset({'deal': 50,
          'contract': 15,
          'treaty': 13,
          'relationship': 12,
          'meeting': 12,
          'man': 10,
          'person': 9,
          'partnership': 7,
          'collaboration': 4,
          'joint venture': 3})

So, now the results are nouns, and similar to the noun ‘deal’.

Phrase similarities given a set of contexts#

If you want to find the phrases that fit a set of contexts then this is also possible.

c1 = [
        c[0] for c in (
            selmr.contexts("considered", topn=None) &
            selmr.contexts("believed", topn=None)
         ).most_common(15)
]

This results in:

[('is', 'to'),
 ('is', 'to be'),
 ('are', 'to'),
 ('was', 'to'),
 ('are', 'to be'),
 ('is', 'to have'),
 ('is', 'to be the'),
 ('was', 'to be'),
 ('were', 'to'),
 ('generally', 'to'),
 ('are', 'to have'),
 ('is', 'to be a'),
 ('is', 'by'),
 ('widely', 'to'),
 ('he', 'to')]

# Not implemented yet
# selmr.most_similar(contexts=c1, topn=10)