selmr package#

Submodules#

selmr.const module#

selmr.const.LanguageMultisets#

alias of Multisets

selmr.extractor module#

selmr.extractor.extract_contexts(init_phrases: dict = None, documents: list = None, params: dict = {})[source]#

Extract and analyze contexts for a set of initial phrases within the given documents.

This function examines the relationships between phrases and their surrounding context to identify meaningful contexts.

Parameters:
  • init_phrases (dict) – A dictionary of initial phrases and their document

  • occurrences.

  • documents (list) – A list of documents.

  • params (dict) – A dictionary containing custom parameters for context

  • extraction.

Returns:

A dictionary of extracted contexts, where keys are context tuples, and values are phrase counters indicating the presence of phrases in those contexts.

Return type:

dict

This function performs the following steps: 1. Initialize and create an initial dictionary for contexts based on phrases and

document occurrences.

  1. Process and analyze these initial contexts to identify meaningful context relationships.

  2. Continuously evaluate and extend the contexts based on phrase co-occurrence.

  3. Return the final dictionary of extracted contexts.

Args Description: - ‘init_phrases’ is a dictionary of phrases and their occurrences in documents. - ‘documents’ is a list of documents. - ‘params’ allows customization of the context extraction process.

Returns Description: - The returned dictionary contains context tuples and their associated phrase

counters.

Example Usage: # Define initial phrases, documents, and parameters initial_phrases = {

‘phrase1’: {0: {(1, 2, 3)}, 1: {(0, 2, 4)}}, ‘phrase2’: {0: {(2, 3, 4)}},

} my_documents = [

‘This is a sample document with phrases.’, ‘Another document for context extraction.’,

] custom_params = {

‘max_context_length’: 5, ‘min_context_count’: 2,

}

# Extract contexts from the initial phrases and documents extracted_contexts = extract_contexts(

init_phrases=initial_phrases, documents=my_documents, params=custom_params

)

Note: - The ‘params’ dictionary can be used to customize the context extraction behavior. - The returned dictionary contains meaningful contexts and their associated phrase

counts.

selmr.extractor.extract_phrases(documents: list = None, params: dict = {})[source]#

Extract phrases from a collection of documents.

This function analyzes the provided documents and extracts phrases based on the specified parameters.

Parameters:
  • documents (list) – A list of documents.

  • params (dict) – A dictionary containing custom parameters for phrase extraction.

Returns:

A dictionary of extracted phrases, where keys are phrases and values are sub-dictionaries with document indices and their respective phrase locations.

Return type:

dict

This function performs the following steps: 1. Iterate through the documents and generate phrases within each document. 2. Create a dictionary that maps phrases to their occurrences in different

documents.

  1. Remove phrases that occur less frequently than the specified minimum phrase count.

  2. Return the dictionary of extracted phrases.

Args Description: - ‘documents’ should be a list of documents. - ‘params’ allows customization of the phrase extraction process.

Returns Description: - The returned dictionary contains phrases and their associated occurrences in

documents.

Example Usage: ``` # Define documents and parameters my_documents = [

‘document1’: ‘This is a sample document with phrases.’, ‘document2’: ‘Another document for phrase extraction.’,

] custom_params = {

‘min_phrase_count’: 2, ‘custom_option’: ‘value’,

}

# Extract phrases from the documents extracted_phrases = extract_phrases(documents=my_documents, params=custom_params) ```

Note: - The ‘params’ dictionary can be used to customize the phrase extraction behavior. - Phrases that occur less frequently than the specified minimum count are removed

from the results.

selmr.extractor.generate_sentence_phrases(sentences: list = None, params: dict = {})[source]#

Generate and yield phrases along with their locations within sentences.

This function iterates through the provided sentences and generates phrases of varying lengths. Phrases are yielded along with their respective locations in the sentences.

Parameters:
  • sentences (list) – A list of sentences to generate phrases from.

  • params (dict) – A dictionary containing custom parameters for phrase generation.

Yields:

tuple – A tuple containing a generated phrase and its location within the sentence.

This generator function performs the following steps: 1. Iterate through the sentences. 2. Generate phrases of varying lengths starting from each word. 3. Yield each generated phrase along with its location if it meets specified

criteria.

Args Description: - ‘sentences’ is a list of sentences to generate phrases from. - ‘params’ allows customization of the phrase generation process.

Yields Description: - The generator yields tuples, where the first element is the generated phrase and the second element is a tuple specifying the location within the sentence.

Example Usage: ``` # Define a list of sentences and parameters my_sentences = [

[‘This’, ‘is’, ‘a’, ‘sample’, ‘sentence.’], [‘Another’, ‘example’, ‘sentence’, ‘for’, ‘phrase’, ‘generation.’]

] custom_params = {

‘max_phrase_length’: 4,

}

# Generate phrases from the sentences and process them for phrase, location in generate_sentence_phrases(sentences=my_sentences, params=custom_params):

print(f”Generated phrase: {phrase} | Location: {location}”)

```

Note: - The ‘params’ dictionary can be used to customize the phrase generation behavior. - The generator yields phrases and their locations, subject to optional

filtering criteria.

selmr.extractor.process_documents(documents: list = None, params: dict = {})[source]#

Process a list of documents to extract and analyze phrases and contexts using the specified LanguageMultisets and parameters.

Parameters:
  • documents (list) – A list of documents to be processed.

  • params (dict) – A dictionary of parameters to customize the processing.

Returns:

An instance of LanguageMultisets containing extracted phrases and their associated contexts.

Return type:

LanguageMultisets

This function performs the following steps: 1. Preprocess each document using the specified parameters. 2. Extract initial phrases from the preprocessed documents. 3. Extract contexts for the initial phrases based on the documents. 4. Create a dictionary that maps phrases to their respective contexts. 5. Create and return a new LanguageMultisets instance with the extracted phrases

and contexts.

Example Usage: ``` # Create a documents instance documents = []

# Define parameters custom_params = {

‘param1’: value1, ‘param2’: value2,

}

# Process documents and obtain LanguageMultisets result = process_documents(

documents=my_documents, params=custom_params

)#

Note: - The ‘params’ argument allows customization of the processing behavior. - The returned LanguageMultisets instance contains phrases and their associated

contexts.

selmr.multisets module#

selmr.search module#

selmr.selmr module#

selmr.skeleton module#

selmr.tokenizer module#

selmr.tokenizer.preprocess(document: str = None, params: dict = {})[source]#
selmr.tokenizer.tokenize_text(text: list = None, forced_sentence_split_characters: list = [])[source]#
selmr.tokenizer.tokenizer(text: str = None)[source]#

Function to create list of sentences with list of words with text and start_char and end_char of each word

Parameters:

text – the text to be tokenized

Module contents#