Introduction¶
Texturizer is a Python package which aims to provide an easy and intuitive way
of generating features from columns of text in a dataset.
The current implementation has been developed in Python 3 and tested on a variety of CSV files.
Motivation¶
Text data can add significant value to machine learning projects, but it is not always obvious how to make use of it. There are a vast number of ways to exploit text as features in a model and it is not always clear what is likely to work.
This package is intended to provide a quick, as well as easily extensible framework to add columns to a dataset using a wide variety of feature engineering approaches.
It can be as either a CLI utility to process a tabular dataset, or as python package that can be included within your ML projects. We include a SciKit Learn Compatible Transformer for using in machine learning pipelines.
Limitations¶
The majority of the features are achieved via RegEx patterns. This makes the features fast to calculate and easily extensible. But it is a very manual process.
Currently the package supports English only. But this could be changed by swapping out the word patterns and dictionaries, and introducing alternative SpacY language models.