10 – Feature Extraction

Okay. We now have clean normalized text. Can we feed this into a statistical or a machine learning model? Not quite. Let’s see why. Text data is represented on modern computers using an encoding such as ASCII or Unicode that maps every character to a number. Computer store and transmit these values as binary, zeros and ones. These numbers also have an implicit ordering. 65 is less than 66 which is less than 67. But does that mean A is less than B, and B is less and C? No. In fact, that would be an incorrect assumption to make and might mislead our natural language processing algorithms. Moreover, individual characters don’t carry much meaning at all. It is words that we should be concerned with, but computers don’t have a standard representation for words. Yes, internally they are just sequences of ASCII or Unicode values but they don’t quite capture the meanings or relationships between words. Compare this with how an image is represented in computer memory. Each pixel value contains the relative intensity of light at that spot in the image. For a color image, we keep one value per primary color; red, green, and blue. These values carry relevant information. Two pixels with similar values are perceptually similar. Therefore, it makes sense to directly use pixel values in a numerical model. Yes, some feature engineering may be necessary such as edge detection or filtering, but pixels are a good starting point. So the question is, how do we come up with a similar representation for text data that we can use as features for modeling? The answer again depends on what kind of model you’re using and what task you’re trying to accomplish. If you want to use a graph based model to extract insights, you may want to represent your words as symbolic nodes with relationships between them like WordNet. For statistical models however, you need some sort of numerical representation. Even then, you have to think about the end goal. If you’re trying to perform a document level task, such as spam detection or sentiment analysis, you may want to use a per document representations such as bag-of-words or doc2vec. If you want to work with individual words and phrases such as for text generation or machine translation, you’ll need a word level representation such as word2vec or glove. There are many ways of representing textual information, and it is only through practice that you can learn what you need for each problem.

%d 블로거가 이것을 좋아합니다: