This project is maintained by pterwoo
A package can be seen as a group of modules, or python files. Libraries are a collection of packages. They usually provide convenient functionalities that allow for users to not have to type out commonly used functions that are not built-in. To install a package we would execute the following code (pandas and numpy will be used as examples):
pip install numpy
pip install pandas
Then, to import the package/library into into the local workspace:
import numpy
import pandas
alternatively, if I wanted to abbreviate numpy as np and pandas as pd, I would:
import numpy as np
import pandas as pd
You would want to do this so that when you are calling a function that belongs in the library, you wouldn’t have to fully type out “pandas” or “numpy.”
pd.readcsv()
as opposed to
pandas.readcsv()
Dataframes are data structures akin to spreadsheets. The pandas library is the most commonly used to handle and work with dataframes in python. To read a file (csv as an example) remotely and import it into the local work session, first you must specify the directory of the file being imported. It is convenient to save the directory to a variable. Then, you would use the read_csv function in the pandas library to read in the actual data, taking in the directory as an argument and specifying the separator
path_to_data = #directory
df = pd.read_csv(path_to_data, sep = ',')
Specifying what comes after the read_ in the function is important, since there are different ways data is stored. The example above deals with a csv (comma separated values) file that have values that are separated by commas, while other forms of data such as tsv (tab separated values) files have values that are separated by tabs. To continue on with the example, you can determine how many rows and columns are in the dataframe by executing:
df.shape
The output lists the rows and columns. Rows and columns combined form the shape of the dataframe.
Import the gapminder.tsv dataset (will assume that the file is a local folder):
path_to_data = gapminder.tsv
df = pd.read_csv(path_to_data, sep = '\t')
to examine the year variable:
df[year]
This reveals that data was collected in a regular interval of 5 years. The most recent observations were from 2007, therefore data from 2012 and 2017. Since for each row there are 6 columns (country, continent, year, lifeExp, pop, and gdpPercap), it is likely that there will be 12 new outcomes added to the dataframe.
df.loc[df['lifeExp'].idxmin()]
Executing this shows that the lowst life expectancy recorded in this dataframe is from Rwanda in 1992, with a life expectancy of 23.599. This could have happened due to the lives lost during the Rwandan Civil War that took place from 1990 to 1994.
I’m going to call the new column gdp since we are multiplying gdp per capita to the population. To create this column:
i = 0
gdp = []
for pop in df['pop']:
a = (pop * (df['gdpPercap'].iloc[i]))
gdp.append(a)
i += 1
df['gdp'] = gdp
Then, to subset to follow the criteria mentioned then sort:
countries = df['country'].isin(['Italy', 'Spain', 'France', 'Germany'])
years = df['year'] == 2007
criteria = countries & years
newdf = df[criteria]
newdf = newdf.sort_values(by=['gdp'], ascending = False)
This is the result.
&: bitwise AND operator. The overlap (intersection in venn diagram) of the values being compared.
True & False
True & True
The first operation returns False, since True and False are two different values with no overlap
The second operation returns True, since both values are exactly the same and have overlap
==: compares values of two objects. If they are equivalent, the operation returns a boolean value True
(1+1) == 2
Returns True since 2 equals 2
|: bitwise OR operator. Union of arguments/values being compared
("cat" != "dog") | (2 > 1)
Returns True since both the arguments are equal. If at least one of the arguments above are to be true, the operation will return True.
^: bitwise XOR (exclusive OR) operator. Values exclusive/distinct to each argument
("cat" != "dog") ^ (2 > 1)
Returns False since there are no exclusive parts in each argument as they are equivalent.
.loc indexes using a label, in that you have top specify the names of the rows or columns while .iloc indexes using an integer value. To illustrate this:
newdf.iloc[0]
newdf.loc[575]
The name of the row with values from Germany is 575 (integer), and the index value for that is 1.
API stands for Application Programming Interface. It involves accessing a remote computer to install packages into the local workspace.
import requests
url = 'https://url'
r = requests.get(url)
filename = 'data_folder'
with open(file_name, 'w') as f:
f.write(r.content)
df = pd.read_csv(name_of_file)
This will take in the data from the API in the url, write it onto a local file and read it into the workspace.
apply() allows you to use lamda functions. It offers a shorter and more convenient way of iterating your data instead of having to write out loops which can take longer time and longer lines of code. They are usually used for functions that are only going to be used once and not have to be called again.
There are many ways you can subset data other than using the .iloc
function. df.filter
can subset rows/columns by the specified index. Calling columns directly can also achieve the same goal, simply executing df['index']
.