texturizer package
Submodules
texturizer.comparison module
- texturizer.comparison.add_comparison_features(df, columns)[source]
This is the entry point to add all the core text similarity features. Note: We left out Ratcliff Obershelp from the set of metrics because it takes close to an order of magnitude longer to compute.
Initial version just includes 4 string edit distance metrics.
texturizer.config module
texturizer.emoticons module
- texturizer.emoticons.add_emoticon_features(df, col)[source]
Given a pandas dataframe and a column name. Check for emoticons in the column and add a set of features that indicate both the presence and emotional flavour of the emoticon.
- texturizer.emoticons.add_text_emoticon_features(df, columns)[source]
Given a pandas dataframe and a set of column names. Add features that detect the presence of emoticons.
- texturizer.emoticons.bck_re = re.compile("[@*\\|/()<>{}\\[\\]]{1,2}[-=^]{0,2}['’`]{0,1}[:;]")
texturizer.emoticons: Emoticon Recognition Text Features
The functions in this library will add columns to a dataframe that indivate whether there are emoticons in certain columns of text, and whether those emoticons represent one of the more common emotions.
- NOTE: In developing these regexes I have deliberately ignored certain emoticons
because of the likelihood of false positive matches in text containing brackets For example emoticons: 8) or (B will not be matched.
To avoid matching characters inside document markup language tags there is a rudimentary regex based tag removal and unescaped version of the text that is expecte to have been generated in the intial simple text function run by the program. This will remove URLS and HTML tags before trying to match emoticons.
Some references used when considering which empticons to include:
https://www.unglobalpulse.org/2014/10/emoticon-use-in-arabic-spanish-and-english-tweets/
https://www.sciencedirect.com/science/article/abs/pii/S0950329317300939
https://www.qualitative-research.net/index.php/fqs/article/view/175/391
texturizer.featurize module
- texturizer.featurize.generate_feature_function(parameters)[source]
This function will take the processed command line arguments that determine the feature to apply and partially apply them to the process_df function. Returning a function that can be used to apply those parameters to multiple chunks of a dataframe.
texturizer.literacy module
texturizer.pos module
texturizer.process module
- texturizer.process.count_lines(path_to_file)[source]
Return a count of total lines in a file. In a way that filesize is irrelevant
- texturizer.process.len_or_null(val)[source]
Alternative len function that will simply return numpy.NA for invalid values. This is needed to get sensible results when running len over a column that may contain nulls
- texturizer.process.load_complete_dataframe(path_to_file)[source]
We load the entire dataset into memory, using the file extension to determine the expected format. We are using encoding=’latin1’ because it appears to permit loading of the largest variety of files. Representation of strings may not be perfect, but is not important for generating a summarization of the entire dataset.
- texturizer.process.load_dictionary(filename, escape=False)[source]
Utility function to load a json serialised dictionary
- texturizer.process.load_word_list(filename, escape=False)[source]
Utility function to load topic vocab word lists for pattern matching.
- texturizer.process.load_word_pattern(filename, prefix='', pluralize=True, bound=True, escape=False)[source]
- texturizer.process.process_file_in_chunks(path_to_file, function_to_apply)[source]
Given a path to a large dataset we will iteratively load it in chunks and apply the supplied function to and write the result to the output stream.
- texturizer.process.remove_escapes_and_non_printable(text)[source]
Apply the codecs escape to decode any escaped characters. Then apply a regex to remove any non printable characters
texturizer.profanity module
- texturizer.profanity.add_profanity_features(df, col)[source]
Given a pandas dataframe and a column name. add simple text match features for profanities.
texturizer.rhetoric module
texturizer.sentiment module
- texturizer.sentiment.add_sentiment_features(df, col)[source]
Given a pandas dataframe and a column name. add simple text match features for sentiment.
texturizer.simple module
- texturizer.simple.add_text_features(df, col)[source]
Given a pandas dataframe and a column name. calculate the simple text summary features and add them.
texturizer.texturizer module
texturizer.texturizer: provides entry point main().
- texturizer.texturizer.get_cmd_line_params(argv)[source]
parse out the option from an array of command line arguments
texturizer.topics module
- texturizer.topics.add_text_topics_features(df, columns, type='flag')[source]
Given a pandas dataframe and a set of column names. calculate the simple text summary features and add them.