3 – 4 EncodingWords Sol V1

First, here’s how I went about creating a vocab to int dictionary and encoding our word data, and there are a few ways to do this. I chose to use this important counter to create a dictionary that maps the most common words in our reviews text to the smallest integers. So the first thing I’m doing is to get a count of how many times each of our words actually appears in our data using counter and passing in our words. Then with these counts, I’m creating assorted vocabulary. This sorts each unique word by its frequency of occurrence. So this vocab should hold all of the unique words that make up our word data without any repeats, and it will be sorted by commonality. I also know that I want to start encoding my words with the integer value of one rather than zero. So the most common word like be or of should actually be encoded as one. I’m making sure that we start our indexing at one by using enumerate and passing in our vocab and our starting index, one. Enumerate is going to return a numerical value, ii, and a word in our vocabulary. It will do this in order. So our first index is going to be one, and the first word is going to be the most common word in our assorted vocabulary. So to create the dictionary vocab to int, I’m taking each unique word in our vocab and mapping it to an index starting at the value one. Great. Next, I’m using this dictionary to tokenize all of our word data. So here, I’m looking at each individual review. Each of these is one item and review split from before when I separated reviews by the newline character. Then, for each word in a review, I’m using my dictionary to convert that word into its integer value, and I’m appending the token as review to reviews_ints. So the end result will be a list of tokenized reviews. Here in the cells below, I’m printing out the length of my dictionary and my first sample encoded review. I can see that my dictionary is a bit over 74,000 words long, which means that we have this many unique words that make up our reviews data. Let’s take a look at this tokenized review. I’m not seeing any zero values which is good, and these encoded values look as I might expect. So I’ve successfully encoded the review words, and I’ll move on to the next task, which is encoding our labels. So in this case, I want to look at my label’s text data and turn the word positive into one and negative into zero. Now we haven’t much processed our labels data, and I know much like the reviews text that a new label is on every new line in this file. So I can get a list of labels, labels_split, by splitting our loaded in data using the newline character as a delimiter. Then I just have a statement that says, for every label in this label_split list, I’m going to add one to my array if it reads as positive, and a zero otherwise. I’m wrapping this in np.array, and that’s all I need to do to create an array of encoded labels. All right. This is a good start. There are still a few data clean up and formatting steps that I’ll want to take before we get to defining our model. So let’s address those tasks next.

Dr. Serendipity에서 더 알아보기

지금 구독하여 계속 읽고 전체 아카이브에 액세스하세요.

Continue reading