texturizer package

Submodules

texturizer.comparison module

texturizer.comparison.add_comparison_features(df, columns)[source]

This is the entry point to add all the core text similarity features. Note: We left out Ratcliff Obershelp from the set of metrics because it takes close to an order of magnitude longer to compute.

Initial version just includes 4 string edit distance metrics.

texturizer.comparison.add_ratcliff_obershelp(df, columns)[source]: Return a copy of a dataframe with features describing matching between the set of named text columns

texturizer.comparison.add_string_match_features(df, columns)[source]: Return a copy of a dataframe with features describing matching between the set of named text columns

texturizer.config module

texturizer.emoticons module

texturizer.emoticons.add_emoticon_features(df, col)[source]: Given a pandas dataframe and a column name. Check for emoticons in the column and add a set of features that indicate both the presence and emotional flavour of the emoticon.

texturizer.emoticons.add_text_emoticon_features(df, columns)[source]: Given a pandas dataframe and a set of column names. Add features that detect the presence of emoticons.

texturizer.emoticons.bck_re = re.compile("[@*\\|/()<>{}\\[\\]]{1,2}[-=^]{0,2}['’`]{0,1}[:;]")

texturizer.emoticons: Emoticon Recognition Text Features

The functions in this library will add columns to a dataframe that indivate whether there are emoticons in certain columns of text, and whether those emoticons represent one of the more common emotions.

NOTE: In developing these regexes I have deliberately ignored certain emoticons: because of the likelihood of false positive matches in text containing brackets For example emoticons: 8) or (B will not be matched.

To avoid matching characters inside document markup language tags there is a rudimentary regex based tag removal and unescaped version of the text that is expecte to have been generated in the intial simple text function run by the program. This will remove URLS and HTML tags before trying to match emoticons.

Some references used when considering which empticons to include:

https://www.unglobalpulse.org/2014/10/emoticon-use-in-arabic-spanish-and-english-tweets/

https://www.researchgate.net/publication/266269913_From_Emoticon_to_Universal_Symbolic_Signs_Can_Written_Language_Survive_in_Cyberspace

https://www.sciencedirect.com/science/article/abs/pii/S0950329317300939

https://www.semanticscholar.org/paper/An-Approach-towards-Text-to-Emoticon-Conversion-and-Jha/3b81505fa7fec81563b2dafae3939fa1b07f3a98

https://www.qualitative-research.net/index.php/fqs/article/view/175/391

https://www.researchgate.net/publication/221622114_M_Textual_Affect_Sensing_for_Sociable_and_Expressive_Online_Communication

texturizer.emoticons.get_emoticon_col_list(col)[source]

texturizer.featurize module

texturizer.featurize.generate_feature_function(parameters)[source]: This function will take the processed command line arguments that determine the feature to apply and partially apply them to the process_df function. Returning a function that can be used to apply those parameters to multiple chunks of a dataframe.

texturizer.featurize.process_df(df, params)[source]: process_df: Function that co-ordinates the process of generating the features

texturizer.literacy module

texturizer.literacy.add_literacy_features(df, col)[source]: Given a pandas dataframe and a column name. add simple text match features for literacy.

texturizer.literacy.add_text_literacy_features(df, columns)[source]: Given a pandas dataframe and a set of column names. calculate the simple literacy features and add them.

texturizer.pos module

texturizer.pos.add_pos_features(df, col)[source]: Given a pandas dataframe and a column name. add features for the proportion of dominant parts of speech Nouns, Verbs, Adjectives, Adverbs, Pronouns and Adpositions

texturizer.pos.add_text_pos_features(df, columns)[source]: Given a pandas dataframe and a set of column names. calculate the part of speech features and add them.

texturizer.process module

texturizer.process.count_lines(path_to_file)[source]: Return a count of total lines in a file. In a way that filesize is irrelevant

texturizer.process.end_profile(proc_name)[source]

texturizer.process.eprint(*args, **kwargs)[source]

texturizer.process.extract_file_extension(path_to_file)[source]

texturizer.process.initialise_profile()[source]

texturizer.process.isNaN(num)[source]

texturizer.process.len_or_null(val)[source]: Alternative len function that will simply return numpy.NA for invalid values. This is needed to get sensible results when running len over a column that may contain nulls

texturizer.process.load_complete_dataframe(path_to_file)[source]: We load the entire dataset into memory, using the file extension to determine the expected format. We are using encoding=’latin1’ because it appears to permit loading of the largest variety of files. Representation of strings may not be perfect, but is not important for generating a summarization of the entire dataset.

texturizer.process.load_dictionary(filename, escape=False)[source]: Utility function to load a json serialised dictionary

texturizer.process.load_word_list(filename, escape=False)[source]: Utility function to load topic vocab word lists for pattern matching.

texturizer.process.load_word_pattern(filename, prefix='', pluralize=True, bound=True, escape=False)[source]

texturizer.process.padded(k, padto=20)[source]

texturizer.process.print_output(df, header=True)[source]

texturizer.process.print_profiles()[source]

texturizer.process.process_file_in_chunks(path_to_file, function_to_apply)[source]: Given a path to a large dataset we will iteratively load it in chunks and apply the supplied function to and write the result to the output stream.

texturizer.process.remove_escapes_and_non_printable(text)[source]: Apply the codecs escape to decode any escaped characters. Then apply a regex to remove any non printable characters

texturizer.process.remove_tags(text)[source]

texturizer.process.remove_urls(text)[source]

texturizer.process.remove_urls_and_tags(text)[source]: Remove any obvious text elements that appear to be either URLs or HTML tags

texturizer.process.start_profile(proc_name)[source]

texturizer.profanity module

texturizer.profanity.add_profanity_features(df, col)[source]: Given a pandas dataframe and a column name. add simple text match features for profanities.

texturizer.profanity.add_text_profanity_features(df, columns)[source]: Given a pandas dataframe and a set of column names. calculate the simple text summary features and add them.

texturizer.profanity.get_profanity_col_list(col)[source]

texturizer.rhetoric module

texturizer.rhetoric.add_rhetoric_counts(df, col)[source]: Given a pandas dataframe and a column name. Count the number of pattern matches for feature

texturizer.rhetoric.add_text_rhetoric_features(df, columns)[source]: Given a pandas dataframe and a set of column names. calculate the rhetoric trait features and add them.

texturizer.sentiment module

texturizer.sentiment.add_sentiment_features(df, col)[source]: Given a pandas dataframe and a column name. add simple text match features for sentiment.

texturizer.sentiment.add_text_sentiment_features(df, columns)[source]: Given a pandas dataframe and a set of column names. calculate the sentiment features and add them.

texturizer.sentiment.add_textblob_features(df, col)[source]

texturizer.simple module

texturizer.simple.add_text_features(df, col)[source]: Given a pandas dataframe and a column name. calculate the simple text summary features and add them.

texturizer.simple.add_text_summary_features(df, columns)[source]: Given a pandas dataframe and a set of column names. calculate the simple text summary features and add them.

texturizer.simple.get_simple_col_list(col)[source]

texturizer.simple.null_tolerant_len(x)[source]

texturizer.texturizer module

texturizer.texturizer: provides entry point main().

texturizer.texturizer.get_cmd_line_params(argv)[source]: parse out the option from an array of command line arguments

texturizer.texturizer.main()[source]: Main texturizer application entry point. parses out CL options and determine the size of the file. Then process the file for the requested features

texturizer.texturizer.print_usage(args)[source]: Command line application usage instrutions.

texturizer.topics module

texturizer.topics.add_text_topics_features(df, columns, type='flag')[source]: Given a pandas dataframe and a set of column names. calculate the simple text summary features and add them.

texturizer.topics.add_topic_counts(df, col, normalize=False)[source]: Given a pandas dataframe and a column name. Count the number of keyword matches for each topic

texturizer.topics.add_topic_indicators(df, col)[source]: Given a pandas dataframe and a column name. add simple text match for top indicators.

texturizer.traits module

texturizer.traits.add_text_trait_features(df, columns)[source]: Given a pandas dataframe and a set of column names. calculate the personality trait features and add them.

texturizer.traits.add_trait_counts(df, col)[source]: Given a pandas dataframe and a column name. Count the number of keyword matches for each trait