3-3-3-8. Creating pandas DataFrames

dictionary of pandas series –> dataframe

dictionary of lists(arrays) –> dataframe (lists(arrays) must be of the same length)

which is a two-dimensional object with labeled rows and

columns and can also hold multiple data types.

If you’re familiar with Excel,

you can think of a DataFrame as a really powerful spreadsheet.

We can create Pandas DataFrames manually or by loading data from a file.

We will start by creating a DataFrame manually from a dictionary,

containing several pandas series.

Let’s create that dictionary and then pass it into Pandas DataFrame function.

Here’s one that contains the shopping carts of two people,

Alice and Bob on an online store.

Each series contains the price of the items and is labeled with the item names.

And let’s confirm the items is of the datatype dictionary.

Now that we have a dictionary,

we are ready to create a DataFrame by passing it to the DataFrame function.

Remember, when using the DataFrame function,

capitalize the D and F in DataFrame.

There are several things to notice here.

First, we see that DataFrames are displayed in a tabular form,

much like a spreadsheet with the labels of the rows and columns in bold.

Also notice that the row labels of the DataFrame are built from

the union of the index labels we provided in the series,

and the column labels of the DataFrame are taken from the keys of the dictionary.

The columns are arranged alphabetically and not in the order given by the dictionary.

Later, we will see that this is not the case

when we load data into a DataFrame from a file.

Lastly, notice the NaN values that appeared in a DataFrame.

NaN stands for not a number,

and is Pandas way of indicating that it doesn’t have

a value for this particular row and index.

For example, if we look at the column Alice,

we see that it has NaN in the watch index.

This is because the dictionary over here didn’t have an item for Alice called watch.

Whenever a DataFrame is created,

if a particular column doesn’t have values for a particular index,

Pandas will put a NaN there.

If we were to feed this data into a machine-learning algorithm,

we would have to remove these NaN values first.

In a later video,

we will learn how to deal with Nan values and clean our data.

For now, we will leave these values in our dataframe.

In this example, we created a Pandas DataFrame from

a dictionary of pandas series that had clearly defined index labels.

If we don’t provide index labels however,

Pandas will use numerical row indices when it creates the DataFrame.

Let’s create the same dictionary without the index labels.

We can see that pandas indexes the rows of the DataFrame starting from zero,

just like NumPy indexes its arrays.

Like we did with the pandas series,

we can also extract information from a DataFrame using attributes.

Let’s print some information on our shopping carts DataFrame from earlier.

We can get the index labels,

column labels and data from our dataframe with these attributes,

and we can use the same attributes to get information about its shape.

This dataframe has two dimensions with five rows and two columns,

making a total size of 10.

When creating the shopping carts DataFrame,

we pass the whole items dictionary to the DataFrame function.

However, there might be cases where you’re only interested in a subset of the data.

Pandas let’s us select which data we want to put in

our DataFrame with the keywords, column and index.

Let’s see some examples.

Here’s a DataFrame that only loads Bob’s shopping cart,

and here’s one that only has selected items for both Alice and Bob.

And this is one that only has selected items from Alice’s shopping cart.

You can also manually create DataFrames from a dictionary of lists or arrays.

The procedure is the same as before,

we start by creating the dictionary and then pass it into the DataFrame function.

In this case however,

all the lists or arrays in a dictionary must be of the same length.

Here’s a dictionary of integers and floats.

Notice that since the data dictionary we created doesn’t have index labels,

Pandas automatically uses numerical row indices when it creates the DataFrame.

We can however add these labels by using the index keyword in the DataFrame function.

The last method we’ll look at for manually creating

Pandas DataFrames is using a list of Python dictionaries.

Here’s an example, again we don’t have index labels.

So Panda put numerical row indices here,

and let’s assume we’re going to use this DataFrame.

to hold the number of items a particular store has in stock.

We’ll rename the index labels to store 1 store 2.

Pandas의 두 번째 주요 데이터 구조는 DataFrame입니다.

레이블이 지정된 행이 있는 2차원 객체이고

열을 포함하며 여러 데이터 유형을 보유할 수도 있습니다.

엑셀에 익숙하신 분들은

DataFrame을 정말 강력한 스프레드시트로 생각할 수 있습니다.

수동으로 또는 파일에서 데이터를 로드하여 Pandas DataFrames를 만들 수 있습니다.

사전에서 수동으로 DataFrame을 생성하여 시작하겠습니다.

여러 판다 시리즈를 포함합니다.

해당 사전을 만든 다음 Pandas DataFrame 함수에 전달해 보겠습니다.

다음은 두 사람의 장바구니가 들어 있는 것입니다.

앨리스와 밥 온라인 스토어.

각 시리즈는 항목의 가격을 포함하고 항목 이름으로 레이블이 지정됩니다.

항목이 데이터 유형 사전인지 확인하겠습니다.

이제 사전이 생겼으니,

DataFrame 함수에 전달하여 DataFrame을 만들 준비가 되었습니다.

DataFrame 함수를 사용할 때 기억하십시오.

DataFrame에서 D와 F를 대문자로 표시하십시오.

여기서 주의할 점이 몇 가지 있습니다.

먼저 DataFrame이 표 형식으로 표시되는 것을 봅니다.

행과 열의 레이블이 굵게 표시된 스프레드시트와 매우 유사합니다.

또한 DataFrame의 행 레이블은

시리즈에서 제공한 인덱스 레이블의 합집합,

DataFrame의 열 레이블은 사전의 키에서 가져옵니다.

열은 사전에서 지정한 순서가 아닌 알파벳순으로 정렬됩니다.

나중에 우리는 이것이 사실이 아님을 알게 될 것입니다

파일에서 DataFrame으로 데이터를 로드할 때.

마지막으로 DataFrame에 나타난 NaN 값을 확인하십시오.

NaN은 숫자가 아니라,

Pandas가 가지고 있지 않다는 것을 나타내는 방법입니다.

이 특정 행 및 인덱스에 대한 값입니다.

예를 들어 Alice 열을 보면

시계 인덱스에 NaN이 있는 것을 볼 수 있습니다.

여기 사전에는 앨리스에 대한 시계라는 항목이 없었기 때문입니다.

DataFrame이 생성될 때마다

특정 열에 특정 인덱스에 대한 값이 없는 경우

팬더는 거기에 NaN을 넣을 것입니다.

이 데이터를 기계 학습 알고리즘에 입력하면

먼저 이러한 NaN 값을 제거해야 합니다.

이후 영상에서는

Nan 값을 처리하고 데이터를 정리하는 방법을 배웁니다.

지금은 이 값을 데이터 프레임에 남겨둘 것입니다.

이 예에서는 다음에서 Pandas DataFrame을 만들었습니다.

인덱스 레이블이 명확하게 정의된 팬더 시리즈 사전.

그러나 색인 레이블을 제공하지 않으면

Pandas는 DataFrame을 생성할 때 숫자 행 인덱스를 사용합니다.

인덱스 레이블이 없는 동일한 사전을 만들어 보겠습니다.

pandas가 0부터 시작하여 DataFrame의 행을 색인화하는 것을 볼 수 있습니다.

NumPy가 배열을 인덱싱하는 것처럼.

판다 시리즈와 마찬가지로

속성을 사용하여 DataFrame에서 정보를 추출할 수도 있습니다.

이전의 장바구니 DataFrame에 대한 정보를 인쇄해 보겠습니다.

인덱스 레이블을 얻을 수 있습니다.

이러한 속성이 있는 데이터 프레임의 열 레이블 및 데이터,

동일한 속성을 사용하여 모양에 대한 정보를 얻을 수 있습니다.

이 데이터 프레임에는 5개의 행과 2개의 열이 있는 2차원이 있습니다.

총 10개의 크기를 만듭니다.

장바구니 DataFrame을 생성할 때,

전체 항목 사전을 DataFrame 함수에 전달합니다.

그러나 데이터의 하위 집합에만 관심이 있는 경우가 있을 수 있습니다.

Pandas 우리가 넣고 싶은 데이터를 선택하자

키워드, 열 및 인덱스가 있는 DataFrame.

몇 가지 예를 살펴보겠습니다.

다음은 Bob의 장바구니만 로드하는 DataFrame입니다.

Alice와 Bob 모두에 대한 항목만 선택한 항목이 있습니다.

그리고 이것은 앨리스의 장바구니에서 선택한 상품만 있는 상품입니다.

목록 또는 배열의 사전에서 수동으로 DataFrame을 만들 수도 있습니다.

절차는 이전과 동일하며,

사전을 만든 다음 DataFrame 함수에 전달합니다.

그러나 이 경우,

사전의 모든 목록 또는 배열은 길이가 같아야 합니다.

다음은 정수와 부동 소수점 사전입니다.

우리가 만든 데이터 사전에는 색인 레이블이 없기 때문에

Pandas는 DataFrame을 생성할 때 자동으로 숫자 행 인덱스를 사용합니다.

그러나 DataFrame 함수에서 index 키워드를 사용하여 이러한 레이블을 추가할 수 있습니다.

수동으로 생성하기 위해 살펴볼 마지막 방법은

Pandas DataFrames는 Python 사전 목록을 사용하고 있습니다.

다음은 인덱스 레이블이 없는 예입니다.

그래서 Panda는 여기에 숫자 행 인덱스를 넣습니다.

이 DataFrame을 사용한다고 가정해 보겠습니다.

특정 상점에 재고가 있는 품목의 수를 유지하기 위해.

인덱스 레이블의 이름을 store 1 store 2로 바꿉니다.

Creating Pandas DataFrames

Pandas DataFrames are two-dimensional data structures with labeled rows and columns, that can hold many data types. If you are familiar with Excel, you can think of Pandas DataFrames as being similar to a spreadsheet. We can create Pandas DataFrames manually or by loading data from a file. In this lesson, we will start by learning how to create Pandas DataFrames manually from dictionaries, and later we will see how we can load data into a DataFrame from a data file.

Create a DataFrame manually

We will start by creating a DataFrame manually from a dictionary of Pandas Series. It is a two-step process:

  1. The first step is to create the dictionary of Pandas Series.
  2. After the dictionary is created we can then pass the dictionary to the pd.DataFrame() function.

We will create a dictionary that contains items purchased by two people, Alice and Bob, on an online store. The Pandas Series will use the price of the items purchased as data, and the purchased items will be used as the index labels to the Pandas Series. Let’s see how this done in code:

# We import Pandas as pd into Python
import pandas as pd

# We create a dictionary of Pandas Series 
items = {'Bob' : pd.Series(data = [245, 25, 55], index = ['bike', 'pants', 'watch']),
         'Alice' : pd.Series(data = [40, 110, 500, 45], index = ['book', 'glasses', 'bike', 'pants'])}

# We print the type of items to see that it is a dictionary
print(type(items))

class ‘dict’

Now that we have a dictionary, we are ready to create a DataFrame by passing it to the pd.DataFrame() function. We will create a DataFrame that could represent the shopping carts of various users, in this case we have only two users, Alice and Bob.

Example 1. Create a DataFrame using a dictionary of Series.

# We create a Pandas DataFrame by passing it a dictionary of Pandas Series
shopping_carts = pd.DataFrame(items)

# We display the DataFrame
shopping_carts
AliceBob
bike500.0245.0
book40.0NaN
glasses110.0NaN
pants45.025.0
watchNaN55.0

There are several things to notice here, as explained below:

  1. We see that DataFrames are displayed in tabular form, much like an Excel spreadsheet, with the labels of rows and columns in bold.
  2. Also, notice that the row labels of the DataFrame are built from the union of the index labels of the two Pandas Series we used to construct the dictionary. And the column labels of the DataFrame are taken from the keys of the dictionary.
  3. Another thing to notice is that the columns are arranged alphabetically and not in the order given in the dictionary. We will see later that this won’t happen when we load data into a DataFrame from a data file.
  4. The last thing we want to point out is that we see some NaN values appear in the DataFrame. NaN stands for Not a Number, and is Pandas way of indicating that it doesn’t have a value for that particular row and column index. For example, if we look at the column of Alice, we see that it has NaN in the watch index. You can see why this is the case by looking at the dictionary we created at the beginning. We clearly see that the dictionary has no item for Alice labeled watches. So whenever a DataFrame is created, if a particular column doesn’t have values for a particular row index, Pandas will put a NaN value there.
  5. If we were to feed this data into a machine learning algorithm we will have to remove these NaN values first. In a later lesson, we will learn how to deal with NaN values and clean our data. For now, we will leave these values in our DataFrame.

In the example above, we created a Pandas DataFrame from a dictionary of Pandas Series that had clearly defined indexes. If we don’t provide index labels to the Pandas Series, Pandas will use numerical row indexes when it creates the DataFrame. Let’s see an example:

Example 2. DataFrame assigns the numerical row indexes by default.

# We create a dictionary of Pandas Series without indexes
data = {'Bob' : pd.Series([245, 25, 55]),
        'Alice' : pd.Series([40, 110, 500, 45])}

# We create a DataFrame
df = pd.DataFrame(data)

# We display the DataFrame
df
AliceBob
040245.0
111025.0
250055.0
345NaN

We can see that Pandas indexes the rows of the DataFrame starting from 0, just like NumPy indexes ndarrays.

Now, just like with Pandas Series we can also extract information from DataFrames using attributes. Let’s print some information from our shopping_carts DataFrame

Example 3. Demonstrate a few attributes of DataFrame

# We print some information about shopping_carts
print('shopping_carts has shape:', shopping_carts.shape)
print('shopping_carts has dimension:', shopping_carts.ndim)
print('shopping_carts has a total of:', shopping_carts.size, 'elements')
print()
print('The data in shopping_carts is:\n', shopping_carts.values)
print()
print('The row index in shopping_carts is:', shopping_carts.index)
print()
print('The column index in shopping_carts is:', shopping_carts.columns)

shopping_carts has shape: (5, 2) shopping_carts has dimension: 2 shopping_carts has a total of: 10 elements

The data in shopping_carts is: [[    500.    245.] [       40.     nan] [     110.     nan] [       45.      25.] [     nan       55.]]

The row index in shopping_carts is: Index([‘bike’, ‘book’, ‘glasses’, ‘pants’, ‘watch’], dtype=’object’)

The column index in shopping_carts is: Index([‘Alice’, ‘Bob’], dtype=’object’)

When creating the shopping_carts DataFrame we passed the entire dictionary to the pd.DataFrame() function. However, there might be cases when you are only interested in a subset of the data. Pandas allows us to select which data we want to put into our DataFrame by means of the keywords columns and index. Let’s see some examples:

# We Create a DataFrame that only has Bob's data
bob_shopping_cart = pd.DataFrame(items, columns=['Bob'])

# We display bob_shopping_cart
bob_shopping_cart
Bob
bike245
pants25
watch55

Example 4. Selecting specific rows of a DataFrame

# We Create a DataFrame that only has selected items for both Alice and Bob
sel_shopping_cart = pd.DataFrame(items, index = ['pants', 'book'])

# We display sel_shopping_cart
sel_shopping_cart
AliceBob
pants4525.0
book40NaN

Example 5. Selecting specific columns of a DataFrame

# We Create a DataFrame that only has selected items for Alice
alice_sel_shopping_cart = pd.DataFrame(items, index = ['glasses', 'bike'], columns = ['Alice'])

# We display alice_sel_shopping_cart
alice_sel_shopping_cart
Alice
glasses110
bike500

You can also manually create DataFrames from a dictionary of lists (arrays). The procedure is the same as before, we start by creating the dictionary and then passing the dictionary to the pd.DataFrame() function. In this case, however, all the lists (arrays) in the dictionary must be of the same length. Let’ see an example:

Example 6. Create a DataFrame using a dictionary of lists

# We create a dictionary of lists (arrays)
data = {'Integers' : [1,2,3],
        'Floats' : [4.5, 8.2, 9.6]}

# We create a DataFrame 
df = pd.DataFrame(data)

# We display the DataFrame
df
FloatsIntegers
04.51
18.22
29.63

Notice that since the data dictionary we created doesn’t have label indices, Pandas automatically uses numerical row indexes when it creates the DataFrame. We can, however, put labels to the row index by using the index keyword in the pd.DataFrame() function. Let’s see an example

Example 7. Create a DataFrame using a dictionary of lists, and custom row-indexes (labels)

# We create a dictionary of lists (arrays)
data = {'Integers' : [1,2,3],
        'Floats' : [4.5, 8.2, 9.6]}

# We create a DataFrame and provide the row index
df = pd.DataFrame(data, index = ['label 1', 'label 2', 'label 3'])

# We display the DataFrame
df
FloatsIntegers
label 14.51
label 28.22
label 39.63

The last method for manually creating Pandas DataFrames that we want to look at is by using a list of Python dictionaries. The procedure is the same as before, we start by creating the dictionary and then passing the dictionary to the pd.DataFrame() function.

Example 8. Create a DataFrame using a of list of dictionaries

# We create a list of Python dictionaries
items2 = [{'bikes': 20, 'pants': 30, 'watches': 35}, 
          {'watches': 10, 'glasses': 50, 'bikes': 15, 'pants':5}]

# We create a DataFrame 
store_items = pd.DataFrame(items2)

# We display the DataFrame
store_items
bikesglassespantswatches
020NaN3035
11550.0510

Again, notice that since the items2 dictionary we created doesn’t have label indices, Pandas automatically uses numerical row indexes when it creates the DataFrame. As before, we can put labels to the row index by using the index keyword in the pd.DataFrame() function. Let’s assume we are going to use this DataFrame to hold the number of items a particular store has in stock. So, we will label the row indices as store 1 and store 2.

Example 9. Create a DataFrame using a of list of dictionaries, and custom row-indexes (labels)

# We create a list of Python dictionaries
items2 = [{'bikes': 20, 'pants': 30, 'watches': 35}, 
          {'watches': 10, 'glasses': 50, 'bikes': 15, 'pants':5}]

# We create a DataFrame  and provide the row index
store_items = pd.DataFrame(items2, index = ['store 1', 'store 2'])

# We display the DataFrame
store_items
bikesglassespantswatches
store 120NaN3035
store 21550.0510

Additional Reading – Pandas Documentation

  1. Refer to the Intro to data structures for an overview of both the data structures – Series and DataFrame.
  2. Refer to the Attributes and underlying data section in the DataFrame documentation.
%d