Text Lab 1 3 9 – A Text Transformation Toolkit

12/2/2020

Text Lab 1 3 9 – A Text Transformation Toolkit

Online Text Transformer helps you to convert text into wanted format: hex value, octal value, decimal value, binary value, lowercase char, uppercase char, invert case, newline and so on. Version 3.8.3 Added function Set-ItemPermission that allows you to easily change permissions on files or folders. Added setting READ permissions to BUITLTINUsers for toolkit's temporary folder so they can get the popup message. Added -AddParameters for Execute-MSP, that lets you add additional parameters.

Text Lab 1 3 9 – A Text Transformation Toolkit Word
Text Lab 1 3 9 – A Text Transformation Toolkit Download
Text Lab 1 3 9 – A Text Transformation Toolkit Word
Text Lab 1 3 9 – A Text Transformation Toolkit Pdf

The goal of this guide is to explore some of the main scikit-learntools on a single practical task: analyzing a collection of textdocuments (newsgroups posts) on twenty different topics.

In this section we will see how to:

load the file contents and the categories
extract feature vectors suitable for machine learning
train a linear model to perform categorization
use a grid search strategy to find a good configuration of boththe feature extraction components and the classifier

Tutorial setup¶

To get started with this tutorial, you must first installscikit-learn and all of its required dependencies.

Please refer to the installation instructionspage for more information and for system-specific instructions.

The source of this tutorial can be found within your scikit-learn folder:

The source can also be found on Github.

The tutorial folder should contain the following sub-folders:

*.rstfiles - the source of the tutorial document written with sphinx
data - folder to put the datasets used during the tutorial
skeletons - sample incomplete scripts for the exercises
solutions - solutions of the exercises

You can already copy the skeletons into a new folder somewhereon your hard-drive named sklearn_tut_workspace where youwill edit your own files for the exercises while keepingthe original skeletons intact:

Machine learning algorithms need data. Go to each $TUTORIAL_HOME/datasub-folder and run the fetch_data.py script from there (afterhaving read them first).

For instance:

Loading the 20 newsgroups dataset¶

The dataset is called “Twenty Newsgroups”. Here is the officialdescription, quoted from the website:

The 20 Newsgroups data set is a collection of approximately 20,000newsgroup documents, partitioned (nearly) evenly across 20 differentnewsgroups. To the best of our knowledge, it was originally collectedby Ken Lang, probably for his paper “Newsweeder: Learning to filternetnews,” though he does not explicitly mention this collection.The 20 newsgroups collection has become a popular data set forexperiments in text applications of machine learning techniques,such as text classification and text clustering.

In the following we will use the built-in dataset loader for 20 newsgroupsfrom scikit-learn. Alternatively, it is possible to download the datasetmanually from the website and use the sklearn.datasets.load_filesfunction by pointing it to the 20news-bydate-train sub-folder of theuncompressed archive folder.

In order to get faster execution times for this first example we willwork on a partial dataset with only 4 categories out of the 20 availablein the dataset:

We can now load the list of files matching those categories as follows:

The returned dataset is a scikit-learn “bunch”: a simple holderobject with fields that can be both accessed as python dictkeys or object attributes for convenience, for instance thetarget_names holds the list of the requested category names:

The files themselves are loaded in memory in the data attribute. Forreference the filenames are also available:

Let’s print the first lines of the first loaded file:

Supervised learning algorithms will require a category label for eachdocument in the training set. In this case the category is the name of thenewsgroup which also happens to be the name of the folder holding theindividual documents.

For speed and space efficiency reasons scikit-learn loads thetarget attribute as an array of integers that corresponds to theindex of the category name in the target_names list. The categoryinteger id of each sample is stored in the target attribute:

It is possible to get back the category names as follows:

You might have noticed that the samples were shuffled randomly when we calledfetch_20newsgroups(...,shuffle=True,random_state=42): this is useful ifyou wish to select only a subset of samples to quickly train a model and get afirst idea of the results before re-training on the complete dataset later.

Extracting features from text files¶

In order to perform machine learning on text documents, we first need toturn the text content into numerical feature vectors.

Bags of words¶

The most intuitive way to do so is to use a bags of words representation:

Assign a fixed integer id to each word occurring in any documentof the training set (for instance by building a dictionaryfrom words to integer indices).
For each document #i, count the number of occurrences of eachword w and store it in X[i,j] as the value of feature#j where j is the index of word w in the dictionary.

The bags of words representation implies that n_features isthe number of distinct words in the corpus: this number is typicallylarger than 100,000.

If n_samples10000, storing X as a NumPy array of typefloat32 would require 10000 x 100000 x 4 bytes = 4GB in RAM whichis barely manageable on today’s computers.

Fortunately, most values in X will be zeros since for a givendocument less than a few thousand distinct words will beused. For this reason we say that bags of words are typicallyhigh-dimensional sparse datasets. We can save a lot of memory byonly storing the non-zero parts of the feature vectors in memory.

scipy.sparse matrices are data structures that do exactly this,and scikit-learn has built-in support for these structures.

Tokenizing text with `scikit-learn`¶

Text preprocessing, tokenizing and filtering of stopwords are all includedin CountVectorizer, which builds a dictionary of features andtransforms documents to feature vectors:

CountVectorizer supports counts of N-grams of words or consecutivecharacters. Once fitted, the vectorizer has built a dictionary of featureindices:

The index value of a word in the vocabulary is linked to its frequencyin the whole training corpus.

From occurrences to frequencies¶

Occurrence count is a good start but there is an issue: longerdocuments will have higher average count values than shorter documents,even though they might talk about the same topics.

To avoid these potential discrepancies it suffices to divide thenumber of occurrences of each word in a document by the total numberof words in the document: these new features are called tf for TermFrequencies.

Another refinement on top of tf is to downscale weights for wordsthat occur in many documents in the corpus and are therefore lessinformative than those that occur only in a smaller portion of thecorpus.

This downscaling is called tf–idf for “Term Frequency timesInverse Document Frequency”.

Both tf and tf–idf can be computed as follows usingTfidfTransformer:

In the above example-code, we firstly use the fit(..) method to fit ourestimator to the data and secondly the transform(..) method to transformour count-matrix to a tf-idf representation.These two steps can be combined to achieve the same end result fasterby skipping redundant processing. This is done through using thefit_transform(..) method as shown below, and as mentioned in the notein the previous section:

Training a classifier¶

Now that we have our features, we can train a classifier to try to predictthe category of a post. Let’s start with a naïve Bayesclassifier, whichprovides a nice baseline for this task. scikit-learn includes severalvariants of this classifier; the one most suitable for word counts is themultinomial variant:

To try to predict the outcome on a new document we need to extractthe features using almost the same feature extracting chain as before.The difference is that we call transform instead of fit_transformon the transformers, since they have already been fit to the training set:

Building a pipeline¶

In order to make the vectorizer => transformer => classifier easierto work with, scikit-learn provides a Pipeline class that behaveslike a compound classifier:

The names vect, tfidf and clf (classifier) are arbitrary.We will use them to perform grid search for suitable hyperparameters below.We can now train the model with a single command:

Evaluation of the performance on the test set¶

Evaluating the predictive accuracy of the model is equally easy:

We achieved 83.5% accuracy. Let’s see if we can do better with alinear support vector machine (SVM),which is widely regarded as one ofthe best text classification algorithms (although it’s also a bit slowerthan naïve Bayes). We can change the learner by simply plugging a differentclassifier object into our pipeline:

We achieved 91.3% accuracy using the SVM. scikit-learn provides furtherutilities for more detailed performance analysis of the results:

As expected the confusion matrix shows that posts from the newsgroupson atheism and Christianity are more often confused for one another thanwith computer graphics.

Parameter tuning using grid search¶

We’ve already encountered some parameters such as use_idf in theTfidfTransformer. Classifiers tend to have many parameters as well;e.g., MultinomialNB includes a smoothing parameter alpha andSGDClassifier has a penalty parameter alpha and configurable lossand penalty terms in the objective function (see the module documentation,or use the Python help function to get a description of these).

Instead of tweaking the parameters of the various components of thechain, it is possible to run an exhaustive search of the bestparameters on a grid of possible values. We try out all classifierson either words or bigrams, with or without idf, and with a penaltyparameter of either 0.01 or 0.001 for the linear SVM:

Obviously, such an exhaustive search can be expensive. If we have multipleCPU cores at our disposal, we can tell the grid searcher to try these eightparameter combinations in parallel with the n_jobs parameter. If we givethis parameter a value of -1, grid search will detect how many coresare installed and use them all:

The grid search instance behaves like a normal scikit-learnmodel. Let’s perform the search on a smaller subset of the training datato speed up the computation:

The result of calling fit on a GridSearchCV object is a classifierthat we can use to predict:

The object’s best_score_ and best_params_ attributes store the bestmean score and the parameters setting corresponding to that score:

A more detailed summary of the search is available at gs_clf.cv_results_.

The cv_results_ parameter can be easily imported into pandas as aDataFrame for further inspection.

Exercises¶

To do the exercises, copy the content of the ‘skeletons’ folder asa new folder named ‘workspace’:

You can then edit the content of the workspace without fear of losingthe original exercise instructions.

Then fire an ipython shell and run the work-in-progress script with:

If an exception is triggered, use %debug to fire-up a postmortem ipdb session.

Refine the implementation and iterate until the exercise is solved.

For each exercise, the skeleton file provides all the necessary importstatements, boilerplate code to load the data and sample code to evaluatethe predictive accuracy of the model.

Exercise 1: Language identification¶

Write a text classification pipeline using a custom preprocessor andCharNGramAnalyzer using data from Wikipedia articles as training set.
Evaluate the performance on some held out test set.

ipython command line:

Exercise 2: Sentiment Analysis on movie reviews¶

Write a text classification pipeline to classify movie reviews as eitherpositive or negative.
Find a good set of parameters using grid search.
Evaluate the performance on a held out test set.

ipython command line:

Exercise 3: CLI text classification utility¶

Using the results of the previous exercises and the cPicklemodule of the standard library, write a command line utility thatdetects the language of some text provided on stdin and estimatethe polarity (positive or negative) if the text is written inEnglish.

Bonus point if the utility is able to give a confidence level for itspredictions.

Where to from here¶

Here are a few suggestions to help further your scikit-learn intuitionupon the completion of this tutorial:

Try playing around with the analyzer and tokennormalisation underCountVectorizer.
If you don’t have labels, try usingClusteringon your problem.
If you have multiple labels per document, e.g categories, have a lookat the Multiclass and multilabel section.
Try using Truncated SVD forlatent semantic analysis.
Have a look at usingOut-of-core Classification tolearn from data that would not fit into the computer main memory.
Have a look at the Hashing Vectorizeras a memory efficient alternative to CountVectorizer.

By Steven Black

Introduction

This article serves to introduce, illustrate, and explore some of the great ( and not so great ) string handling capabilities of Visual FoxPro.

I always seem to be involved with solving many text-data related problems in my VFP projects. On the surface, handling text isnt very sexy and seemingly not very interesting. I think otherwise, and I hope youll agree.

This document is split into three sections: Inbound is about getting text into the VFP environment so you can work with it. Processing is about manipulating the text, and Outbound is about sending text on its way when youre done.

Text Lab 1 3 9 – A Text Transformation Toolkit Word

To illustrate text handling in VFP, I am using the complete text of Tolstoys War And Peace, included on the conference CD as WarAndPeace.TXT, which along with thousands of works of literature, are available on the web, including here among others.

This article was originally written using Visual FoxPro version 6, and has since been updated for VFP 7 and VFP 8.

Some facts about VFP strings

Here are a few things you need to know about VFP strings:

In functional terms, there is no difference between a character field and a memo field. All functions that work on characters also work on memos.

The maximum number of characters that VFP can handle in a string is 16, 777, 184.

Inbound

This section is all about getting text into your processing environment.

Inbound text from table fields

To retrieve text from a table field, simply assign it to a memory variable.

Inbound from text files

There are many ways to retrieve text from files on disk.

FILETOSTR( cFileName ) is used to place the contents of a disk file into a string memory variable. This is among my favorite new functions in VFP 6. Its both useful and fast. For example, the following code executes in one-seventh of a second on my 220Mhz Pentium laptop.

In other words, on a very modest laptop ( by todays standards ) VFP can load the full text from Tolstoys War And Peace in one-seventh of a second.

Low Level File Functions ( LLFF ) are somewhat more cumbersome but offer great control. LLFF are also very fast. The following example reads the entire contents of Tolstoys War And Peace from disk into memory:

Given the similar execution times, I think we can conclude that internally, LLFF and FILETOSTR() are implemented similarly. However with the LLFF we also have fine control. For example, FGETS() allows us to read a line at a time. To illustrate, the following code reads the first 15 lines of War And Peace into array wpLines.

We can also retrieve a segment from War And Peace. FSEEK() moves the LLFF pointer, and the FREAD() function is used to read a range. Lets read, say, 1000 bytes about half way through the book.

Inbound from text files, with pre-processing

Sometimes you need to pre-process text before it is usable. For example, you may have an HTML file from which you need to clean and remove tags. Or maybe you have the problem exhibited by our copy of War and Peace, which has embedded hard-returns at the end of each line. How can we create a streaming document that we can actually format?

Often the answer is to use the APPEND FROM command, which imports from file into a table, and moreover supports a large variety of file formats. The strategy always works something like this: You create a single-field table, and you use APPEND FROM ... TYPE SDF to load it

Now youre good to go: Youve got a table of records that you can manipulate and transform to your hearts content using VFPs vast collection of functions.

Processing

This section discusses a wide variety of string manipulation techniques in Visual FoxPro. Lets say weve got some text in our environment, now lets muck with it.

Does a sub-string exist?

There are many ways to determine if a sub-string exists in a string. The $ command returns True or False if a sub-string is contained in a string. This command is fast. Try this:

The AT()and ATC()functions are also great for determining if a sub-string exists, the former having the advantage of being case insensitive and, moreover, their return values gives you an exact position of the sub-string.

The OCCURS() function will also tell you if a sub-string exists, and moreover tell you how many times the sub-string occurs. This code will count the number of occurrences of a variety of sub-strings in War And Peace.

Locating sub-strings in strings is something VFP does really well.

Locating sub-strings

One of the basic tasks in almost any string manipulation is locating sub strings within larger strings. Four useful functions for this are AT(), RAT(), ATC(), and RATC(). These locate the ordinal position of sub-strings locating from the left ( AT() ), from the right ( RAT() ), both of which have case-insensitive variants ( ATC(), and RATC() ). All these functions are very fast and scale well with file size. For example, lets go look for THE END in War And Peace.

You can also check for the nth occurrence of a sub-string, as illustrated below where we find the 1st, 101st, 201st...701st occurrence of the word Russia in War And Peace.

Two other functions are useful for locating strings: ATLINE() and ATCLINE(). These return the line number of the first occurrence of a string.

Note: Prior to VFP 7, functions that are sensitive to SET MEMOWIDTH, like ATLINE() and ATCLINE(), among others, are dog-slow on larger strings and so do not scale well at all.

Traversing text line-by-line

Iterating through text, one line at a time, is a common task. Heres the way VFP developers have been doing it for years: Using the MEMLINES() and MLINE() functions. Like this:

Thats pathetic performance. 20+ seconds to iterate through 767 lines! Fortunately, theres a trick to using MLINE(), which is to pass the _MLINE system memory variable as the third parameter. Like this.

Now thats more like it a fifty-fold improvement. A surprising number of VFP developers dont know this idiom with _MLINE even though its been documented in the FoxPro help since version 2 at least.

Starting in VFP 6 all this is obsolete, since ALINES() is a screaming new addition to the language. Lets see how these routines look and perform with ALINES().

Another twenty-fold improvement in speed. I think the lesion is clear: If you are using MLINE() in your applications, and you are using VFP 6, then its time to switch to ALINES(). There are just two major differences: First, ALINES() is limited by VFPs 65, 000 array element limit, and second, successive lines with only CHR( 13 ) carriage returns are considered as one line. For example:

But if you use carriage return + line feed, CHR( 13 )+CHR( 10 ), youll get the results you expect.

This is a bit unnerving if blank lines are important, so beware and use CHR( 13 )+CHR( 10 ) to avoid this problem.

Now, just for fun, lets rip through War And Peace using ALINES().

Excuse me, but wow, considering were creating a 54, 337 element array from a file on disk, then were traversing the entire array assigning each elements contents to a memory variable, and were back in 3.4 seconds.

What about just creating the array of War And Peace:

So, on my Pentium 233 laptop using VFP 6, we can load War and Peace from disk into a 54, 000-item array in 2.2 seconds. On my newer desktop machine, a Pentium 500, this task is subsecond.

Traversing text word-by-word

You could recursively traverse a string word-by-word by using, among other things, the return value from AT( , x, n )and SUBS( , , ) and, if you are doing that, youre missing a great and little known feature of VFP.

Two new functions are great for word-by-word text processing. The GETWORDCOUNT() and GETWORDNUM() functions, return the number of words and individual words respectively.

Prior to VFP 7, use the Words() and WordNum() functions, which are available to you when you load the FoxTools.FLL library, return the number of words and individual words respectively.

Lets see how they perform. Lets first count the words in War And Peace.

The GETWORDCOUNT() function is also useful for counting all sorts of tokens since you can pass the word delimiters in the second parameter. How many sentences are there in War And Peace?

GETWORDNUM() returns a specific word from a string. Whats the 666th word in War And Peace? What about the 500000th?

Text Lab 1 3 9 – A Text Transformation Toolkit Download

Similarly to GETWORDCOUNT(), we can use GETWORDNUM() to return a token from a string by specifying the delimiter. Whats the 2000th sentence in War And Peace?

Substituting text

VFP has a number of useful functions for substituting text. STRTRAN(), CHRTRAN(), CHRTRANC(), STUFF(), and STUFFC().

STRTRAN() replaces occurrences of a string with another. For example, lets change all occurrences of Anna to the McBride twins in War And Peace.

Thats over 125 replacements per second, which is phenomenal. What about removing strings?

So it appears that STRTRAN() both adds and removes strings with equal aplomb. What of CHRTRAN(), which swaps characters? Lets, say, change all s to ch in War and Peace.

Which isnt bad considering that there are 159, 218 occurrences of character s in War And Peace.

Text Lab 1 3 9 – A Text Transformation Toolkit Word

However dont try to use CHRTRAN() when the second parameter is an empty string. The performance of CHRTRAN() in these circumstances is terrible. If you need to suppress sub-strings, use STRTRAN() instead.

String Concatenation

VFP has tremendous concatenation speed if you use it in a particular way. Since many common tasks, like building web pages, involve building documents one element at a time, you should know that string expressions of the form x = x+y are very fast in VFP. Consider this:

The same type of performance applies if you build strings small chunks at a time, which is a typical scenario in dynamic Web pages whether a template engine or raw output is used. For example:

This full optimization occurs as long as the string is adding something to itself and as long as the string concatenated is stored in a variable. Using class properties is somewhat less efficient. String optimization does not occur if the first expression on the right of the = sign is not the same as the string being concatenated. So:

is not optimized in this fashion. The above line, placed in the example above, takes 25 seconds! So appending strings to strings is blazingly fast in most common situations.

Text Lab 1 3 9 – A Text Transformation Toolkit Pdf

Outputting text

So you've got text, maybe a lot of it, what are your options for writing it to disk.

Foremostly theres the new STRTOFILE() function which creates a disk file wit the contents of a string. Lets write War And Peace to disk.

Which means that you can dish 3+ Mb to disk in about a half-second.

You can also use Low Level File Functions ( LLFF ) to output text. The FWRITE() function dumps all or part of a string to disk. The FPUTS() function outputs a single line from the string, and moves the pointer

Here again, the similar performance times between FWRITE() and STRTOFILE() are striking, just as they were when comparing FREAD() and FILETOSTR().

Heres an example of outputting War And Peace line-by-line using FPUTS(). Since were using ALINES(), its not that onerous a task. In fact, its very slick!

Conclusion

So, there you have it, a cafeteria-style tour of VFPs text handling capabilities. I personally think that most of the code snippets Ive shown here have amazing and borderline unbelievable execution speeds. I hope Ive been able to show that VFP really excels at string handling.

Comments are closed.

YOUR CART

Text Lab 1 3 9 – A Text Transformation Toolkit

Tutorial setup¶

Loading the 20 newsgroups dataset¶

Extracting features from text files¶

Bags of words¶

Tokenizing text with `scikit-learn`¶

From occurrences to frequencies¶

Training a classifier¶

Building a pipeline¶

Evaluation of the performance on the test set¶

Parameter tuning using grid search¶

Exercises¶

Exercise 1: Language identification¶

Exercise 2: Sentiment Analysis on movie reviews¶

Exercise 3: CLI text classification utility¶

Where to from here¶

Introduction

Text Lab 1 3 9 – A Text Transformation Toolkit Word

Inbound

Inbound text from table fields

Inbound from text files

Inbound from text files, with pre-processing

Processing

Does a sub-string exist?

Locating sub-strings

Traversing text line-by-line

Traversing text word-by-word

Text Lab 1 3 9 – A Text Transformation Toolkit Download

Substituting text

Text Lab 1 3 9 – A Text Transformation Toolkit Word

String Concatenation

Text Lab 1 3 9 – A Text Transformation Toolkit Pdf

Outputting text

Conclusion

Author

Archives

Categories

YOUR CART

Text Lab 1 3 9 – A Text Transformation Toolkit

Tutorial setup¶

Loading the 20 newsgroups dataset¶

Extracting features from text files¶

Bags of words¶

Tokenizing text with scikit-learn¶

From occurrences to frequencies¶

Training a classifier¶

Building a pipeline¶

Evaluation of the performance on the test set¶

Parameter tuning using grid search¶

Exercises¶

Exercise 1: Language identification¶

Exercise 2: Sentiment Analysis on movie reviews¶

Exercise 3: CLI text classification utility¶

Where to from here¶

Introduction

Text Lab 1 3 9 – A Text Transformation Toolkit Word

Inbound

Inbound text from table fields

Inbound from text files

Inbound from text files, with pre-processing

Processing

Does a sub-string exist?

Locating sub-strings

Traversing text line-by-line

Traversing text word-by-word

Text Lab 1 3 9 – A Text Transformation Toolkit Download

Substituting text

Text Lab 1 3 9 – A Text Transformation Toolkit Word

String Concatenation

Text Lab 1 3 9 – A Text Transformation Toolkit Pdf

Outputting text

Conclusion

Author

Archives

Categories

Tokenizing text with `scikit-learn`¶