Online Text Transformer helps you to convert text into wanted format: hex value, octal value, decimal value, binary value, lowercase char, uppercase char, invert case, newline and so on. Version 3.8.3 Added function Set-ItemPermission that allows you to easily change permissions on files or folders. Added setting READ permissions to BUITLTINUsers for toolkit's temporary folder so they can get the popup message. Added -AddParameters for Execute-MSP, that lets you add additional parameters.
The goal of this guide is to explore some of the main
scikit-learn tools on a single practical task: analyzing a collection of textdocuments (newsgroups posts) on twenty different topics.
In this section we will see how to:
Tutorial setup¶
To get started with this tutorial, you must first installscikit-learn and all of its required dependencies.
Please refer to the installation instructionspage for more information and for system-specific instructions.
The source of this tutorial can be found within your scikit-learn folder:
The source can also be found on Github.
The tutorial folder should contain the following sub-folders:
You can already copy the skeletons into a new folder somewhereon your hard-drive named
sklearn_tut_workspace where youwill edit your own files for the exercises while keepingthe original skeletons intact:
Machine learning algorithms need data. Go to each
$TUTORIAL_HOME/data sub-folder and run the fetch_data.py script from there (afterhaving read them first).
For instance:
Loading the 20 newsgroups dataset¶
The dataset is called “Twenty Newsgroups”. Here is the officialdescription, quoted from the website:
The 20 Newsgroups data set is a collection of approximately 20,000newsgroup documents, partitioned (nearly) evenly across 20 differentnewsgroups. To the best of our knowledge, it was originally collectedby Ken Lang, probably for his paper “Newsweeder: Learning to filternetnews,” though he does not explicitly mention this collection.The 20 newsgroups collection has become a popular data set forexperiments in text applications of machine learning techniques,such as text classification and text clustering.
In the following we will use the built-in dataset loader for 20 newsgroupsfrom scikit-learn. Alternatively, it is possible to download the datasetmanually from the website and use the
sklearn.datasets.load_files function by pointing it to the 20news-bydate-train sub-folder of theuncompressed archive folder.
In order to get faster execution times for this first example we willwork on a partial dataset with only 4 categories out of the 20 availablein the dataset:
We can now load the list of files matching those categories as follows:
The returned dataset is a
scikit-learn “bunch”: a simple holderobject with fields that can be both accessed as python dict keys or object attributes for convenience, for instance thetarget_names holds the list of the requested category names:
The files themselves are loaded in memory in the
data attribute. Forreference the filenames are also available:
Let’s print the first lines of the first loaded file:
Supervised learning algorithms will require a category label for eachdocument in the training set. In this case the category is the name of thenewsgroup which also happens to be the name of the folder holding theindividual documents.
For speed and space efficiency reasons
scikit-learn loads thetarget attribute as an array of integers that corresponds to theindex of the category name in the target_names list. The categoryinteger id of each sample is stored in the target attribute:
It is possible to get back the category names as follows:
You might have noticed that the samples were shuffled randomly when we called
fetch_20newsgroups(...,shuffle=True,random_state=42) : this is useful ifyou wish to select only a subset of samples to quickly train a model and get afirst idea of the results before re-training on the complete dataset later.
Extracting features from text files¶
In order to perform machine learning on text documents, we first need toturn the text content into numerical feature vectors.
Bags of words¶
The most intuitive way to do so is to use a bags of words representation:
The bags of words representation implies that
n_features isthe number of distinct words in the corpus: this number is typicallylarger than 100,000.
If
n_samples10000 , storing X as a NumPy array of typefloat32 would require 10000 x 100000 x 4 bytes = 4GB in RAM whichis barely manageable on today’s computers.
Fortunately, most values in X will be zeros since for a givendocument less than a few thousand distinct words will beused. For this reason we say that bags of words are typicallyhigh-dimensional sparse datasets. We can save a lot of memory byonly storing the non-zero parts of the feature vectors in memory.
scipy.sparse matrices are data structures that do exactly this,and scikit-learn has built-in support for these structures.
Tokenizing text with
|
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |