3 – TF-IDF

One limitation of the bag-of-words approach is that it treats every word as being equally important, whereas intuitively, we know that some words occur frequently within a corpus. For example, when looking at financial documents, cost or price may be a pretty common term. We can compensate for this by counting the number of documents in which each word occurs, this can be called document frequency, and then dividing the term frequencies by the document frequency of that term. This gives us a metric that is proportional to the frequency of occurrence of a term in a document, but inversely proportional to the number of documents it appears in. It highlights the words that are more unique to a document, and thus better for characterizing it. You may have heard of, or used, the TF-IDF transform before. It’s simply the product of two words, very similar to what we’ve seen so far, a term frequency and an inverse document frequency. The most commonly used form of TF-IDF defines term frequency as the raw count of a term, t, in a document, d, divided by the total number of terms in d, and inverse document frequency as the logarithm of the total number of documents in the collection, d, divided by the number of documents where t is present. Several variations exist that try to normalize, or smooth the resulting values, or prevent edge cases such as divide-by-zero errors. Overall, TF-IDF is an innovative approach to assigning weights to words that signify their relevance in documents.

%d 블로거가 이것을 좋아합니다: