3-3-3-12. Loading Data into a pandas DataFrame

When working with data you’ll most likely use databases from many sources.

Pandas allows us to load databases of different formats into DataFrames.

One of the most popular formats used to store data is CSV or comma separated values.

We can load CSV files into DataFrames using the read CSV function.

Let’s load Google stock data into a DataFrame.

This file was taken from Yahoo finances and contains Google stock data from August 19th,

2004 to October 13th, 2017.

The DataFrame consists of 3,313 rows and seven columns.

Let’s look at the stock data.

We can see it’s a pretty large dataset and that Pandas

automatically assign numerical row indices to the DataFrame.

Pandas also use the names that appeared in the CSV file to assign the column labels.

When dealing with large data sets like this one,

it’s often useful to just take a look at

the first few rows of data instead of the whole dataset.

We can take a look at the first five rows using the head method.

We can also take a look at the last five rows of data using the tail method.

Both of these accept an integer as an argument

if you want to specify a different number of rows to return.

For example, I can get the last eight rows like this,

or the first two rows like this.

Let’s do a quick check to see whether we have any none values in this data set.

To do so, we will use the “Is no” method followed by

the “Any” method to check whether any of the columns contain none values.

This shows us that we have no missing data.

When dealing with large data sets,

it’s often useful to get statistical information from them.

Pandas provides a described method which returns

descriptive statistics on each column of the DataFrame.

If desired, we can apply the describe method on a single column like this.

Similarly, you can also look at one statistic by

using one of the many statistical functions that Pandas provides.

Here are some examples.

This gives us the maximum value in each column,

and this gives us the mean of each column,

and this gives us the minimum value in a specific column.

Another important statistical measure is data correlation.

We can use the core method to get the correlation between different columns.

Lastly, let’s look at the group by method,

which allows us to group data to get different types of information.

For the following examples,

we are going to load data about a fake company.

This data contains information for the years 1990 through 1992.

And for each year, we see the name of the employees,

the department they worked for,

their age, and their annual salary.

Now, let’s use the group by method to get information.

Let’s calculate how much money the company spent on salaries each year.

To do this, we will group the data by year and then add

up the salaries of all of the employees with the sum method.

The company spent a total of $150,000 in 1990,

$162,000 in 1991, and $174,000 in 1992.

If we want to know the average salary for each year,

we can repeat the last step replacing the sum method with the mean method.

Now, let’s see how much in total each employee got paid during those five years.

In this case, we will group the data by name and then add up the salaries for each year.

Now let’s see what the salary distribution per department was per year.

In this case, we will group the data by year and

by department and then add up the salaries for each group.

We can see that in 1990,

the Admin department paid $55,000 in salaries,

HR $50,000 and R and D, $48,000.

And in 1992 the admin department paid $122,000 in salaries, and so on.

데이터로 작업할 때 많은 소스의 데이터베이스를 사용할 가능성이 큽니다.

Pandas를 사용하면 다양한 형식의 데이터베이스를 DataFrame에 로드할 수 있습니다.

데이터를 저장하는 데 가장 많이 사용되는 형식 중 하나는 CSV 또는 쉼표로 구분된 값입니다.

CSV 읽기 기능을 사용하여 CSV 파일을 DataFrame에 로드할 수 있습니다.

Google 주식 데이터를 DataFrame에 로드해 보겠습니다.

이 파일은 야후 파이낸스에서 가져왔으며 8월 19일 구글 주식 데이터를 포함하고 있습니다.

2004년 ~ 2017년 10월 13일.

DataFrame은 3,313개의 행과 7개의 열로 구성됩니다.

주식 데이터를 살펴보자.

꽤 큰 데이터세트이고 Pandas가

DataFrame에 숫자 행 인덱스를 자동으로 할당합니다.

Pandas는 또한 CSV 파일에 나타난 이름을 사용하여 열 레이블을 할당합니다.

이와 같은 큰 데이터 세트를 다룰 때,

그냥 살펴보는 것이 종종 유용합니다.

전체 데이터 세트 대신 데이터의 처음 몇 행.

head 메서드를 사용하여 처음 5개 행을 살펴볼 수 있습니다.

꼬리 방법을 사용하여 데이터의 마지막 5개 행을 살펴볼 수도 있습니다.

둘 다 정수를 인수로 받아들입니다.

반환할 다른 행 수를 지정하려는 경우.

예를 들어 다음과 같이 마지막 8개 행을 얻을 수 있습니다.

또는 이와 같은 처음 두 행.

이 데이터 세트에 none 값이 있는지 빠르게 확인하겠습니다.

그렇게 하기 위해 우리는 “아니오” 방법을 사용할 것입니다.

모든 열에 값이 포함되어 있지 않은지 여부를 확인하는 “Any” 메서드.

이것은 우리에게 누락된 데이터가 없음을 보여줍니다.

큰 데이터 세트를 다룰 때,

그들로부터 통계 정보를 얻는 것이 종종 유용합니다.

Pandas는 다음을 반환하는 설명된 메서드를 제공합니다.

DataFrame의 각 열에 대한 기술 통계입니다.

원하는 경우 다음과 같이 단일 열에 describe 메서드를 적용할 수 있습니다.

마찬가지로 다음을 통해 하나의 통계를 볼 수도 있습니다.

Pandas가 제공하는 많은 통계 기능 중 하나를 사용합니다.

여기 예시들이 있습니다.

이것은 우리에게 각 열의 최대 값을 제공합니다.

이것은 우리에게 각 열의 평균을 제공합니다.

이것은 특정 열의 최소값을 제공합니다.

또 다른 중요한 통계적 측정은 데이터 상관관계입니다.

핵심 방법을 사용하여 서로 다른 열 간의 상관 관계를 얻을 수 있습니다.

마지막으로 group by method를 살펴보자.

이를 통해 데이터를 그룹화하여 다양한 유형의 정보를 얻을 수 있습니다.

다음 예의 경우

우리는 가짜 회사에 대한 데이터를 로드할 것입니다.

이 데이터에는 1990년부터 1992년까지의 정보가 포함되어 있습니다.

그리고 매년 직원들의 이름을 보면,

그들이 일했던 부서,

그들의 나이, 그리고 그들의 연봉.

이제 group by 메소드를 사용하여 정보를 얻습니다.

회사가 매년 급여에 지출한 금액을 계산해 보겠습니다.

이를 위해 데이터를 연도별로 그룹화한 다음 추가합니다.

합계 방법으로 모든 직원의 급여를 올립니다.

이 회사는 1990년에 총 $150,000를 지출했으며,

1991년 $162,000, 1992년 $174,000

연도별 평균 급여를 알고 싶다면,

합계 방법을 평균 방법으로 교체하는 마지막 단계를 반복할 수 있습니다.

이제 각 직원이 5년 동안 총 얼마를 받았는지 봅시다.

이 경우 데이터를 이름별로 그룹화한 다음 각 연도의 급여를 합산합니다.

이제 부서별 연봉 분포가 1년에 얼마인지 봅시다.

이 경우 연도별로 데이터를 그룹화하고

부서별로 급여를 계산한 다음 각 그룹의 급여를 합산합니다.

우리는 1990년에 그것을 볼 수 있습니다.

관리 부서는 급여로 $55,000를 지불했으며,

HR $50,000 및 R 및 D, $48,000.

그리고 1992년에 관리 부서는 급여로 $122,000 등을 지불했습니다.

Loading Data into a pandas DataFrame

The GOOG.csv and fake_company.csv are available to download at the bottom of this page. If it doesn’t get downloaded upon clicking, try right-click and choose the “Save As…” option.

In machine learning you will most likely use databases from many sources to train your learning algorithms. Pandas allows us to load databases of different formats into DataFrames. One of the most popular data formats used to store databases is csv. CSV stands for Comma Separated Values and offers a simple format to store data. We can load CSV files into Pandas DataFrames using the pd.read_csv() function. Let’s load Google stock data into a Pandas DataFrame. The GOOG.csv file contains Google stock data from 8/19/2004 till 10/13/2017 taken from Yahoo Finance.

Example 1. Load the data from a .csv file.

# We load Google stock data in a DataFrame
Google_stock = pd.read_csv('./GOOG.csv')

# We print some information about Google_stock
print('Google_stock is of type:', type(Google_stock))
print('Google_stock has shape:', Google_stock.shape)

Google_stock is of type: class ‘pandas.core.frame.DataFrame’
Google_stock has shape: (3313, 7)

We see that we have loaded the GOOG.csv file into a Pandas DataFrame and it consists of 3,313 rows and 7 columns. Now let’s look at the stock data

Example 2. Look at the first few rows of the DataFrame

Google_stock
DateOpenHighLowCloseAdj CloseVolume
02004-08-1949.67689951.69378347.66995249.84580249.84580244994500
12004-08-2050.17863554.18756149.92528553.80505053.80505023005800
22004-08-2355.01716656.37334454.17266154.34652754.34652718393200
… …
33112017-10-12987.450012994.119995985.000000987.830017987.8300171262400
33122017-10-13992.000000997.210022989.000000989.679993989.6799931157700

3313 rows × 7 columns

We see that it is quite a large dataset and that Pandas has automatically assigned numerical row indices to the DataFrame. Pandas also used the labels that appear in the data in the CSV file to assign the column labels.

When dealing with large datasets like this one, it is often useful just to take a look at the first few rows of data instead of the whole dataset. We can take a look at the first 5 rows of data using the .head() method, as shown below

Example 3. Look at the first 5 rows of the DataFrame

Google_stock.head()
DateOpenHighLowCloseAdj CloseVolume
02004-08-1949.67689951.69378347.66995249.84580249.84580244994500
12004-08-2050.17863554.18756149.92528553.80505053.80505023005800
22004-08-2355.01716656.37334454.17266154.34652754.34652718393200
32004-08-2455.26058255.43941951.45036352.09616552.09616515361800
42004-08-2552.14087353.65105151.60436252.65751352.6575139257400

We can also take a look at the last 5 rows of data by using the .tail() method:

Example 4. Look at the last 5 rows of the DataFrame

Google_stock.tail()
DateOpenHighLowCloseAdj CloseVolume
33082017-10-09980.000000985.424988976.109985977.000000977.000000891400
33092017-10-10980.000000981.570007966.080017972.599976972.599976968400
33102017-10-11973.719971990.710022972.250000989.250000989.2500001693300
33112017-10-12987.450012994.119995985.000000987.830017987.8300171262400
33122017-10-13992.000000997.210022989.000000989.679993989.6799931157700

We can also optionally use .head(N) or .tail(N) to display the first and last N rows of data, respectively.

Let’s do a quick check to see whether we have any NaN values in our dataset. To do this, we will use the .isnull() method followed by the .any() method to check whether any of the columns contain NaN values.

Example 5. Check if any column contains a NaN. Returns a boolean for each column label.

Google_stock.isnull().any()

Date                  False
Open                False
High                  False
Low                   False
Close                 False
Adj Close          False
Volume             False
dtype: bool

We see that we have no NaN values.

When dealing with large datasets, it is often useful to get statistical information from them. Pandas provides the .describe() method to get descriptive statistics on each column of the DataFrame. Let’s see how this works:

Example 6. See the descriptive statistics of the DataFrame

# We get descriptive statistics on our stock data
Google_stock.describe()
OpenHighLowCloseAdj CloseVolume
count3313.0000003313.0000003313.0000003313.0000003313.0000003.313000e+03
mean380.186092383.493740376.519309380.072458380.0724588.038476e+06
std223.818650224.974534222.473232223.853780223.8537808.399521e+06
min49.27451750.54127947.66995249.68186649.6818667.900000e+03
25%226.556473228.394516224.003082226.407440226.4074402.584900e+06
50%293.312286295.433502289.929291293.029114293.0291145.281300e+06
75%536.650024540.000000532.409973536.690002536.6900021.065370e+07
max992.000000997.210022989.000000989.679993989.6799938.276810e+07

If desired, we can apply the .describe() method on a single column as shown below:

Example 7. See the descriptive statistics of one of the columns of the DataFrame

# We get descriptive statistics on a single column of our DataFrame
Google_stock['Adj Close'].describe()

count         3313.000000
mean           380.072458
std                223.853780
min                 49.681866
25%              226.407440
50%              293.029114
75%              536.690002
max              989.679993
Name: Adj Close, dtype: float64

Similarly, you can also look at one statistic by using one of the many statistical functions Pandas provides. Let’s look at some examples:

Example 8. Statistical operations – Min, Max, and Mean

# We print information about our DataFrame  
print()
print('Maximum values of each column:\n', Google_stock.max())
print()
print('Minimum Close value:', Google_stock['Close'].min())
print()
print('Average value of each column:\n', Google_stock.mean())

Maximum values of each column:
Date            2017-10-13
Open                        992
High                    997.21
Low                          989
Close                  989.68
Adj Close           989.68
Volume        82768100
dtype: object

Minimum Close value: 49.681866

Average value of each column:
Open            3.801861e+02
High             3.834937e+02
Low              3.765193e+02
Close            3.800725e+02
Adj Close     3.800725e+02
Volume        8.038476e+06
dtype: float64

Another important statistical measure is data correlation. Data correlation can tell us, for example, if the data in different columns are correlated. We can use the .corr() method to get the correlation between different columns, as shown below:

Example 9. Statistical operation – Correlation

# We display the correlation between columns
Google_stock.corr()
OpenHighLowCloseAdj CloseVolume
Open1.0000000.9999040.9998450.9997450.999745-0.564258
High0.9999041.0000000.9998340.9998680.999868-0.562749
Low0.9998450.9998341.0000000.9998990.999899-0.567007
Close0.9997450.9998680.9998991.0000001.000000-0.564967
Adj Close0.9997450.9998680.9998991.0000001.000000-0.564967
Volume-0.564258-0.562749-0.567007-0.564967-0.5649671.000000

A correlation value of 1 tells us there is a high correlation and a correlation of 0 tells us that the data is not correlated at all.

groupby() method

We will end this Introduction to Pandas by taking a look at the .groupby() method. The .groupby() method allows us to group data in different ways. Let’s see how we can group data to get different types of information. For the next examples, we are going to load fake data about a fictitious company.

# We load fake Company data in a DataFrame
data = pd.read_csv('./fake_company.csv')

data
YearNameDepartmentAgeSalary
01990AliceHR2550000
11990BobRD3048000
21990CharlieAdmin4555000
31991DakotaHR2652000
41991ElsaRD3150000
51991FrankAdmin4660000
61992GraceAdmin2760000
71992HoffmanRD3252000
81992InaarAdmin2862000

We see that the data contains information for the year 1990 through 1992. For each year we see name of the employees, the department they work for, their age, and their annual salary. Now, let’s use the .groupby() method to get information.

Example 10. Demonstrate groupby() and sum() method

Let’s calculate how much money the company spent on salaries each year. To do this, we will group the data by Year using the .groupby() method and then we will add up the salaries of all the employees by using the .sum() method.

# We display the total amount of money spent in salaries each year
data.groupby(['Year'])['Salary'].sum()

Year
1990     153000
1991     162000
1992     174000
Name: Salary, dtype: int64

We see that the company spent a total of 153,000 dollars in 1990, 162,000 in 1991, and 174,000 in 1992.

Example 11. Demonstrate groupby() and mean() method

Now, let’s suppose I want to know what was the average salary for each year. In this case, we will group the data by Year using the .groupby() method, just as we did before, and then we use the .mean() method to get the average salary. Let’s see how this works

# We display the average salary per year
data.groupby(['Year'])['Salary'].mean()

Year
1990     51000
1991     54000
1992     58000
Name: Salary, dtype: int64

We see that the average salary in 1990 was 51,000 dollars, 54,000 in 1991, and 58,000 in 1992.

Example 12. Demonstrate groupby() on single column

Now let’s see how much did each employee gets paid in those three years. In this case, we will group the data by Name using the .groupby() method and then we will add up the salaries for each year. Let’s see the result

# We display the total salary each employee received in all the years they worked for the company
data.groupby(['Name'])['Salary'].sum()

Name
Alice         162000
Bob          150000
Charlie     177000
Name: Salary, dtype: int64

We see that Alice received a total of 162,000 dollars in the three years she worked for the company, Bob received 150,000, and Charlie received 177,000.

Example 13. Demonstrate groupby() on two columns

Now let’s see what was the salary distribution per department per year. In this case, we will group the data by Year and by Department using the .groupby() method and then we will add up the salaries for each department. Let’s see the result

# We display the salary distribution per department per year.
data.groupby(['Year', 'Department'])['Salary'].sum()

Year     Department
1990    Admin              55000
             HR                    50000
             RD                    48000
1991    Admin              60000
             HR                    52000
             RD                    50000
1992    Admin            122000
             RD                    52000
Name: Salary, dtype: int64

We see that in 1990 the Admin department paid 55,000 dollars in salaries,the HR department paid 50,000, and the RD department 48,0000. While in 1992 the Admin department paid 122,000 dollars in salaries and the RD department paid 52,000.

We recommend you practice the examples available at the 10 minutes to pandas as a conclusive tutorial.

Supporting Materials

%d 블로거가 이것을 좋아합니다: