Create Financial Analysis and With Multiple Sheets and Charts in Excel from Python

What will we cover in this tutorial?

We will get financial data from FPM cloud (Free stock API) for the last years and generate a 3 sheet Excel sheet with charts similar to this. All done from Python.

Step 1: Collect the data

In this tutorial we are only going to use the example data on Apple, that is free available without registering on FMP Cloud. If you want to do it on another stock yo will need to register on their site.

What we need is the income statement and cash flow statement. They are available as JSON on their page (income statement and cash flow statement).

As our main library we will use Pandas.

The data can be collected with the following code.

import pandas as pd

income_statement = pd.read_json("https://fmpcloud.io/api/v3/income-statement/AAPL?limit=120&apikey=demo")
income_statement.set_index('date', inplace=True)
cash_flow = pd.read_json("https://fmpcloud.io/api/v3/cash-flow-statement/AAPL?limit=120&apikey=demo")
cash_flow.set_index('date', inplace=True)

Notice that we set the index to be the date column. This makes the further work easier.

Step 2: Prepare the data

The next step we need to do is to make the necessary calculations and prepare the data.

We are only interested Revenue, Earnings Per Share (EPS) and Free Cash Flow (FCF). So let’s take that data and keep it in a DataFrame (The Pandas main data structure).

data = income_statement[['revenue', 'eps']].copy()
data['fcf'] = cash_flow['freeCashFlow']

Now the data should look something like this.

	        revenue	        eps	fcf
date			
2020-09-26	274515000000	3.3600	7.336500e+10
2019-09-28	260174000000	2.9925	5.889600e+10
2018-09-29	265595000000	3.0025	6.412100e+10
2017-09-30	229234000000	2.3175	5.080300e+10
2016-09-24	215639000000	2.0875	5.349700e+10

Step 3: Calculate Percentage change

We are actually not interested in the actual values. The only thing we care about is the percentage growth from year to year.

As the data is in reverse order, and we need to calculate from the beginning to the end. We first need to order the dates from old to newest.

Then it is straight forward to calculate using pct_change() method.

data = data.sort_index()

data['rev %-chg'] = data['revenue'].pct_change()
data['eps %-chg'] = data['eps'].pct_change()
data['fcf %-chg'] = data['fcf'].pct_change()

Now the data should look similar to this (or the tail of it).

	        revenue	        eps	fcf	        rev %-chg	      eps %-chg	fcf %-chg
date						
2016-09-24	215639000000	2.0875	5.349700e+10	-0.077342	-0.100216	-0.233326
2017-09-30	229234000000	2.3175	5.080300e+10	0.063045	0.110180	-0.050358
2018-09-29	265595000000	3.0025	6.412100e+10	0.158620	0.295577	0.262150
2019-09-28	260174000000	2.9925	5.889600e+10	-0.020411	-0.003331	-0.081487
2020-09-26	274515000000	3.3600	7.336500e+10	0.055121	0.122807	0.245670

Step 5: Preparing data used in each Excel sheet

The next step is to prepare the data for each sheet in the Excel document.

We only need the last 10 years of data and will use loc to get that. Then we create a DataFrame for each sheet.

data_set = data.loc['2010-01-01':]
data_set.index = data_set.index.year

rev = data_set[['revenue', 'rev %-chg']].copy()
eps = data_set[['eps', 'eps %-chg']].copy()
fcf = data_set[['fcf', 'fcf %-chg']].copy()

Notice that we set the index only to be the year.

Step 6: Creating the Excel sheet

To create the Excel sheet we will use Pandas with the XlsxWriter.

This can be done as follows.

def create_sheet(writer, df, sheet_name):
    df.to_excel(writer, sheet_name=sheet_name)

writer = pd.ExcelWriter("financials.xlsx", engine='xlsxwriter')
workbook = writer.book

create_sheet(writer, rev, 'Revenue')
create_sheet(writer, eps, 'EPS')
create_sheet(writer, fcf, 'FCF')

workbook.close()

This will create a 3-sheet Excel document. One sheet for each DataFrame (rev, eps, fcf).

The function create_sheet(…) is practical, as each DataFrame is similar and we need to create the same sheet for each DataFrame.

The result will be in the Excel document financial.xlsx.

Step 7: Creating the charts

Now it pays to have the function create_sheet(…).

The full Excel document can be created by the following code.

def create_sheet(writer, df, sheet_name):
    df.to_excel(writer, sheet_name=sheet_name)
    worksheet = writer.sheets[sheet_name]
    workbook = writer.book
      
    chart = workbook.add_chart({'type': 'line'})
    
    chart.add_series({
        'name': sheet_name,
        'categories': [sheet_name, 1, 0, 11, 0],
        'values': [sheet_name, 1, 1, 11, 1],
    })
    
    column_chart = workbook.add_chart({'type': 'column'})
    
    column_chart.add_series({
        'name': sheet_name,
        'categories': [sheet_name, 1, 0, 11, 0],
        'values': [sheet_name, 1, 2, 11, 2],
        'y2_axis': True,
    })
    
    chart.combine(column_chart)
    chart.set_title({'name': sheet_name})
    chart.set_x_axis({'name': 'Date'})
    chart.set_y_axis({'name': 'Value'})
    column_chart.set_y2_axis({'name': '%-chg'})
    
    worksheet.insert_chart("E1", chart)

writer = pd.ExcelWriter("financials.xlsx", engine='xlsxwriter')
workbook = writer.book

create_sheet(writer, rev, 'Revenue')
create_sheet(writer, eps, 'EPS')
create_sheet(writer, fcf, 'FCF')

workbook.close()writer = pd.ExcelWriter("financials.xlsx", engine='xlsxwriter')
workbook = writer.book

create_sheet(writer, rev, 'Revenue')
create_sheet(writer, eps, 'EPS')
create_sheet(writer, fcf, 'FCF')

workbook.close()

The details of creating the charts in Excel can be found on XlsxWriter. Basically, it is sending informations to the XlsxWriter engine through dictionaries. You need to send the values that you would set in Excel, if you were working inside there.

Again, the result will be in the Excel document financial.xlsx.

How To use Matplotlib Object Oriented with NumPy and Pandas

What will we cover in this tutorial?

If you like data visualization with NumPy and Pandas, then you must have encountered Matplotlib.

And if you also, like to program in an object oriented fashion, then most tutorial will make you feel wondering if no one loves the art of beautiful code?

Let me elaborate. The integration and interaction with Matplotlib is done in a functional way with a lot of side effects. Not nice.

Not sure what I talk about? We will cover that too.

Step 1: How NumPy is demonstrated to make plots with Matplotlib and what is wrong with it

Let’s make a simple example.

import matplotlib.pyplot as plt
import numpy as np

x = np.linspace(0, 5, 11)
y = x ** 2
plt.plot(x, y)
plt.xlabel("X Label")
plt.ylabel("Y Label")
plt.title("Title")
plt.show()

This will result in the following chart.

That is nice and easy! So what is wrong with it?

Side effects!

What is a side effect in programming?

…that is to say has an observable effect besides returning a value (the main effect) to the invoker of the operation.

https://en.wikipedia.org/wiki/Side_effect_(computer_science)

What does that mean?

Well, let’s examine the above example.

We call plt.plt(x, y) and what happens? Actually we don’t know. We do not get anything in return.

Continue to call plt.xlabel(…), plt.ylabel(…), and plt.title(…). Then we call plt.show() to see the result. Hence, we change the state of the plt library we imported. See, we did not create an object. We call the library directly.

This is difficult as a programmer to understand without having deep knowledge of the library used.

So how to do it in more understandable way?

Step 2: How to create a chart with Matplotlib with NumPy in an object oriented way and why it is better

Let’s look at this code and examine it.

import matplotlib.pyplot as plt
import numpy as np

x = np.linspace(0, 5, 11)
y = x ** 2

fig, ax = plt.subplots()
ax.plot(x, y)
ax.set_xlabel("X Label")
ax.set_ylabel("Y Label")
ax.set_title("Title")
fig.show()
plt.waitforbuttonpress()

Here we do it differently but get the same result. It is more understandable that when we call a method on object ax, that the state of ax is changing and not something in the library hidden in some side effect.

You can also show the the figure fig by calling show() and not the library. This requires that we add waitforbuttonpress() on plt, otherwise it will destroy the window immediately.

Note, that you do not have these challenges in JuPyter notebook – the plots are shown without the call to show.

You could keep the plt.show() instead of fig.show() and plt.waitforbuttonpress(). But the above code is more intuitive and easier to understand.

How to create a chart with Matplotlib of a Pandas DataFrame in an object oriented way

This is straight forward as Matplotlib is well integrated with Pandas.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

x = np.linspace(0, 5, 11)
y = x ** 2

df = pd.DataFrame(data=y, index=x)

fig, ax = plt.subplots()
ax.plot(df)
ax.set_xlabel("X Label")
ax.set_ylabel("Y Label")
ax.set_title("Title")
fig.show()
plt.waitforbuttonpress()

Notice, that the DataFrame is created from the NumPy arrays. Hence, here we do not gain anything from using it. This is just to exemplify how easy it is to use s in an object oriented way with Pandas.

Final thoughts

I have found that programmer either hate or love Matplotlib. I do not always know why, but I have discovered that this non-object oriented way of using Matplotlib is annoying some programmers.

This is a good reason to hate it, but I would say that there are no good alternative to Matplotlib – or at least, they are build upon Matplotlib.

I like the power and ease using Matplotlib. I do like that the option of using it object oriented, which makes the code more intuitive and easier to understand for other programmers.

How to Create Excel Report Automation with Python

What will we cover in this tutorial?

How to create excel report automation with Python. This will be done on the data similar to this.

The data we work with – 1000 lines like the above – get Excel sheet form link below.

And we will create Excel reports from Python like these ones.

The first sheet we will create in Excel from Python.
The second Excel sheet we create from Python.

Get direct Download link to JuPyter Notebook and Excel sheet used in YouTube video and follow along.

Step 1: Load the data in Pandas DataFrame

Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

Pandas has great integration with Excel – both reading and writing Excel files. We will use that in this tutorial.

To read the data simply use the read_excel(…).

import pandas as pd

data = pd.read_excel("SalesData.xlsx", index_col="Date")
print(data.head())

Which will result in output similar to this.

	Sales rep	Item	Price	Quantity	Sale
Date					
2020-05-31	Mia	Markers	4	1	4
2020-02-01	Mia	Desk chair	199	2	398
2020-09-21	Oliver	Table	1099	2	2198
2020-07-15	Charlotte	Desk pad	9	2	18
2020-05-27	Emma	Book	12	1	12

Step 2: Group the data by Sales rep

To identify how much each Sales rep has sold, we will group it and calculate the sum.

repr = data.groupby("Sales rep")
repr_sales = repr['Sale'].sum()

Now that is too simple, right?

The data in rep_sales looks like this.

Sales rep
Charlotte     74599
Emma          65867
Ethan         40970
Liam          66989
Mia           88199
Noah          78575
Oliver        89355
Sophia       103480
William       80400
Name: Sale, dtype: int64

Step 3: Group the data by month, change to month names and sum it

This is a bit more challenging, as we need to group the dates in months and rename the index to month names.

Except if you know how to do it, as you will in a moment.

months = data.groupby(pd.Grouper(freq="M"))
months_sales = months['Sale'].sum()
months_sales.index = months_sales.index.month_name()

You are awesome. See, you use groupby on a Grouper with frequency month (“M”). To use the month name on index and not the last day of month, you reassign the index to use index.month_name().

This creates the following output where you can see the month names are used.

Date
January      69990
February     51847
March        67500
April        58401
May          40319
June         59397
July         64251
August       51571
September    55666
October      50093
November     57458
December     61941
Name: Sale, dtype: int64

Step 4: Create Excel sheet with the data from Step 2 and 3

This is done by creating an Excel writer.

writer = pd.ExcelWriter("SalesReport.xlsx")
repr_sales.to_excel(writer, sheet_name="Sale per rep")
months_sales.to_excel(writer, sheet_name="Sale per month")
writer.close()

This will create an Excel sheet of name SalesReport.xlsx, with two sheets Sale per rep and Sale per month, respectively.

But let’s try to add the two charts.

Step 5: Adding the charts

This is a bit more involved. As you see, the above steps have been quite straight forward and not required any high level Python programming skills (or Excel, except you need to know how to open an Excel sheet).

writer = pd.ExcelWriter("SalesReport.xlsx")
repr_sales.to_excel(writer, sheet_name="Sale per rep")
months_sales.to_excel(writer, sheet_name="Sale per month")

chart = writer.book.add_chart({'type': 'column'})
chart.add_series({
    'values': '=\'Sale per rep\'!$B$2:$B$10',
    'categories': '=\'Sale per rep\'!$A$2:$A$10',
    'name': "Sale"
})
writer.sheets['Sale per rep'].insert_chart("D2", chart)

chart = writer.book.add_chart({'type': 'column'})
chart.add_series({
    'values': '=\'Sale per month\'!$B$2:$B$13',
    'categories': '=\'Sale per month\'!$A$2:$A$13',
    'name': "Sale"
})
writer.sheets['Sale per month'].insert_chart("D2", chart)

writer.close()

This will create the sheets we have been waiting for. The charts are dynamic, meaning, that if you change the values the charts will change accordingly.

First sheet from our results
Second sheet from out results

Final remarks

Remember to see the YouTube video, which explains the material in more depth.

If you like the above content, please share it on a social media or point a friend in this direction.

How To Color Cells Dynamically in Excel Based of Formula done from Python

What will we cover in this tutorial?

We will learn how you can dynamically format the cells in Excel to have a specific color based on the value of a function – all done from Python.

As an example we will color the cells of two columns depending on which is larger.

If the value in MA50 is larger than value in MA200 it will color green and red otherwise.

In this concrete example we will collect historical stock prices and calculate the Moving Averages of 50 and 200 days on the closing price. In a crossover way to utilize moving averages, you would send a buy-signal when the 50-day moving average crosses above the 200-day moving average and a sell-signal if the other way around.

Step 1: Collect the data from Yahoo! Finance using Pandas-datareader

A great library to use to collect financial data is the Pandas-datareader.

We will collect data from Yahoo! Finance. As an example we will use General Motors historical stock data.

This can be done as follows.

import pandas_datareader as pdr
import datetime as dt


start = dt.datetime(2019, 1, 1)
ticker = pdr.get_data_yahoo("GM", start)

This will result in a Pandas DataFrame with data similar to this.

                 High        Low       Open      Close      Volume  Adj Close
Date                                                                         
2019-01-02  33.810001  32.410000  32.849998  33.639999   7662300.0  31.893360
2019-01-03  33.680000  32.200001  33.340000  32.250000  11618300.0  30.575533
2019-01-04  33.570000  32.430000  32.799999  33.330002  10784000.0  31.599455
2019-01-07  34.970001  33.480000  33.709999  34.360001  10722300.0  32.575977
2019-01-08  35.680000  34.490002  35.000000  34.810001  10532900.0  33.002617

Step 2: Calculate the Moving Averages

The moving average can be calculated by applying the rolling() and mean() methods on the DataFrame.

The rolling() creates a window size which is the number of observations used for calculating the statistic.

To calculate the Moving Average of window size 50 will use the last 50 data points to calculate the mean. To be consistent with Yahoo! Finance way of calculate the value, they use the Close price.

import pandas_datareader as pdr
import datetime as dt
import xlsxwriter
import pandas as pd


start = dt.datetime(2019, 1, 1)
ticker = pdr.get_data_yahoo("GM", start)

ticker['MA50'] = ticker['Close'].rolling(50).mean()
ticker['MA200'] = ticker['Close'].rolling(200).mean()

Which will give a tail similar to this output.

                 High        Low       Open  ...  Adj Close     MA50     MA200
Date                                         ...                              
2020-12-14  42.540001  41.419998  42.490002  ...  41.619999  38.5632  29.03155
2020-12-15  42.160000  41.310001  41.990002  ...  41.660000  38.7772  29.08725
2020-12-16  41.869999  40.810001  41.790001  ...  41.419998  38.9976  29.13670
2020-12-17  42.029999  41.430000  41.709999  ...  42.029999  39.2058  29.19635
2020-12-18  42.042801  41.139999  42.020000  ...  41.389999  39.3894  29.25985

Where the Moving Averages are added to the DataFrame.

Step 3: Exporting to Excel using XlsxWriter

We will use the powerful XlsxWriter to create our Excel sheet.

You can write the DataFrame directly by using to_excel(…), but we want to do more than that. Hence, we use the XlsxWriter directly.

import pandas_datareader as pdr
import datetime as dt
import pandas as pd


start = dt.datetime(2019, 1, 1)
ticker = pdr.get_data_yahoo("GM", start)

ticker['MA50'] = ticker['Close'].rolling(50).mean()
ticker['MA200'] = ticker['Close'].rolling(200).mean()

# We only want the index as a date and not datetime object
ticker.index = ticker.index.date

# We skip the first 200 entries, as it they do not have the MA200
ticker = ticker.iloc[200:]
# We reverse the DataFrame to have newest first
ticker = ticker.iloc[::-1]

# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('moving_average.xlsx', engine='xlsxwriter')

# Convert the dataframe to an XlsxWriter Excel object.
ticker.to_excel(writer, sheet_name='Moving Averages')

# Remember to close the writer to write the Excel sheet
writer.close()

This will create an Excel sheet similar to this one.

Step 4: Adding formatting to the cells

This is a bit more complex as we need to do the magic here.

We first need to create a reference to the sheet (worksheet) we work on to access the functionality.

The first thing we do is to resize column A, such that it opens in the correct size to see the date. This is done by using set_column(…).

Then we create a format, which can be applied on Excel cells. This is how we color them. Hence, we create one for green cells and later one for red cells.

We insert them by using conditional_format(…), which a range of cells. This can be applied in different ways, we use the A1 notation.

The options parameter in conditional_format(…) must be a dictionary containing the parameters that describe the type and style of the conditional format. To see details of the format please refer to the manual.

import pandas_datareader as pdr
import datetime as dt
import pandas as pd


start = dt.datetime(2019, 1, 1)
ticker = pdr.get_data_yahoo("GM", start)

ticker['MA50'] = ticker['Close'].rolling(50).mean()
ticker['MA200'] = ticker['Close'].rolling(200).mean()

# We only want the index as a date and not datetime object
ticker.index = ticker.index.date

# We skip the first 200 entries, as it they do not have the MA200
ticker = ticker.iloc[200:]
# We reverse the DataFrame to have newest first
ticker = ticker.iloc[::-1]

# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('moving_average.xlsx', engine='xlsxwriter')

# Convert the dataframe to an XlsxWriter Excel object.
ticker.to_excel(writer, sheet_name='Moving Averages')

# Get the xlsxwriter objects from the dataframe writer object.
workbook = writer.book
worksheet = writer.sheets['Moving Averages']
# Resize the column A
worksheet.set_column("A:A", 12)

# Calculate the last row number (we insert first DataFrame row in row 2)
last_row = len(ticker) + 1

# Create a format for a green cell
cell_format_green = workbook.add_format()
cell_format_green.set_bg_color('green')

# Create a conditional formatted of type formula
worksheet.conditional_format('H2:I{}'.format(last_row), {'type':     'formula',
                                                         'criteria': '=$H2>=$I2',
                                                         'format':   cell_format_green})

# Create a format for a red cell
cell_format_red = workbook.add_format()
cell_format_red.set_bg_color('red')

# Create a conditional formatted of type formula
worksheet.conditional_format('H2:I{}'.format(last_row), {'type':     'formula',
                                                         'criteria': '=$H2<$I2',
                                                         'format':   cell_format_red})

# Remember to close the writer
writer.close()

This will result in the expected Excel sheet.

Step 5: Validating it works as expected

Now for the final test.

The colors should be updated dynamically. Let’s try to change the value and see what happens.

It does. If we change the value of H2 to 10, it turns the cells to red.

How to Create Awesome Mosaic Picture in Excel with Python

What will we do in this tutorial?

Create a mosaic in Excel using Python.

See the tutorial on YouTube and hear how you can use it as prank!

Step 1: How to create a mosaic from a photo

If you want a deeper description of how to create a mosaic you should read the following tutorial, which shows the code on how to do it.

Step 2: Install the necessary libraries

You need to install OpenCV. If you use PyCharm you can follow this tutorial.

Otherwise you can install the libraries as follows.

pip install opencv-python
pip install numpy
pip install xlsxwriter

Please ask if you have troubles with it.

Step 3: The code used to create the Mosaic in Excel

This will just provide the code for you to enjoy. The process code is used from the tutorial linked above, where it is described.

import cv2
import numpy as np
import xlsxwriter


def create_mosaic_in_excel(photo, box_height, box_width, col_width=2, row_height=15):
    # Get the height and width of the photo
    height, width, _ = photo.shape

    # Create Excel workbook and worksheet
    workbook = xlsxwriter.Workbook('mosaic.xlsx')
    worksheet = workbook.add_worksheet("Urgent")

    # Resize columns and rows
    worksheet.set_column(0, width//box_width - 1, col_width)
    for i in range(height//box_height):
        worksheet.set_row(i, row_height)

    # Create mosaic
    for i in range(0, height, box_height):
        for j in range(0, width, box_width):
            # Create region of interest (ROI)
            roi = photo[i:i + box_height, j:j + box_width]
            # Use numpy to calculate mean in ROI of color channels
            b_mean = np.mean(roi[:, :, 0])
            g_mean = np.mean(roi[:, :, 1])
            r_mean = np.mean(roi[:, :, 2])

            # Convert mean to int
            b_mean_int = b_mean.astype(int).item()
            g_mean_int = g_mean.astype(int).item()
            r_mean_int = r_mean.astype(int).item()

            # Create color code
            color = '#{:02x}{:02x}{:02x}'.format(r_mean_int, g_mean_int, b_mean_int)

            # Add color code to cell
            cell_format = workbook.add_format()
            cell_format.set_bg_color(color)
            worksheet.write(i//box_height, j//box_width, "", cell_format)

    # Close and write the Excel sheet
    workbook.close()


def main():
    photo = cv2.imread("rune.png")

    number_cols = 50
    number_rows = 45

    # Get height and width of photo
    height, width, _ = photo.shape
    box_width = width // number_cols
    box_height = height // number_rows

    # To make sure that it we can slice the photo in box-sizes
    width = (width // box_width) * box_width
    height = (height // box_height) * box_height
    photo = cv2.resize(photo, (width, height))

    # Create the Excel mosaic
    create_mosaic_in_excel(photo.copy(), box_height, box_width, col_width=2, row_height=15)


main()

Step 4: What to modify

The above tutorial assumes a photo of me in rune.png. I used the one taken from this page. You should obviously change it to something else.

You can change how many columns and rows in the Excel sheet the mosaic should be. This is done by changing the values of number_cols and number_rows.

Then you can change the values of col_width=2 and row_height=15.

In the YouTube video I use this free picture from Pexels (download) and modify number_cols = 100 and number_rows = 90, and col_width=1 and row_height=6.

How To Extract Numbers From Strings in HTML Table and Export to Excel from Python

What will we cover in this tutorial?

How to import a HTML table to Excel.

But that is easy? You can do that directly from Excel.

Yes, but what if entries contains numbers and string together, then the import will convert it to a string and makes it difficult to get the number extracted from the string.

Luckily, we will cover how to do that easy with Python.

Step 1: Get the dataset

Find your favorite HTML table online. For the purpose of this tutorial I will use this one from Wikipedia with List of Metro Systems.

View of HTML table of interest

Say, what if we wanted to sum how many stations are in this table (please notice that the table contains more rows than shown in the above picture).

If you import that directly into Excel, with the import functionality you will realize that the column of stations will be interpreted as strings. The problem is, that it will look like 19[13], while we are only interested in the number 19.

There is no build in functionality to do that directly in Excel.

But let’s try to import this into Python. We will use Pandas to do that. If you are new to Pandas, please see this tutorial.

import pandas as pd


url = "https://en.wikipedia.org/wiki/List_of_metro_systems"
tables = pd.read_html(url)

print(tables[0].head())

Which will result in the following output.

/Users/admin/PycharmProjects/LearningSpace/venv/bin/python /Users/admin/PycharmProjects/LearningSpace/test.py
           City    Country  ...          System length Annual ridership(millions)
0       Algiers    Algeria  ...  18.5 km (11.5 mi)[14]           45.3 (2019)[R 1]
1  Buenos Aires  Argentina  ...  56.7 km (35.2 mi)[16]          337.7 (2018)[R 2]
2       Yerevan    Armenia  ...   13.4 km (8.3 mi)[17]           20.2 (2019)[R 3]
3        Sydney  Australia  ...  36 km (22 mi)[19][20]  14.2 (2019) [R 4][R Nb 1]
4        Vienna    Austria  ...  83.3 km (51.8 mi)[21]          459.8 (2019)[R 6]

Where we have the same problem. If we inspect the type of the columns we get the following.

City                          object
Country                       object
Name                          object
Yearopened                    object
Year of lastexpansion         object
Stations                      object
System length                 object
Annual ridership(millions)    object
dtype: object

Where actually all columns are of type object, which here is equivalent to a string.

Step 2: Extract the numbers from Stations and System length column

The DataStructure of the tables in tables is a DataFrame, which is Pandas main data structure.

As the strings we want to convert from string to integers are containing more information than just the numbers, we cannot use the DataFrame method to_numeric().

We want to convert something of the form 19[13] to 19.

To do that easily, we will use the apply(…) method on the DataFrame.

The apply-method takes a function as argument and applies it on each row.

We will use a lambda function as argument. If you are not familiar with lambda functions, please read this tutorial.

import pandas as pd


url = "https://en.wikipedia.org/wiki/List_of_metro_systems"
tables = pd.read_html(url)
table = tables[0]

table['Stations'] = table.apply(lambda row: int(row['Stations'].split('[')[0]), axis=1)
table['System length'] = table.apply(lambda row: float(row['System length'].split()[0]), axis=1)

print(table[['Stations', 'System length']].head())

Which will result in the following output.

   Stations  System length
0        19           18.5
1        90           56.7
2        10           13.4
3        13           36.0
4        98           83.3

This is what we want.

Step 3: Export to Excel

Wow. This needs an entire step?

Well, of course it does.

Here we need to unleash the power of Pandas and use the to_excel(…) method.

import pandas as pd


url = "https://en.wikipedia.org/wiki/List_of_metro_systems"
tables = pd.read_html(url)
table = tables[0]

table['Stations'] = table.apply(lambda row: int(row['Stations'].split('[')[0]), axis=1)
table['System length'] = table.apply(lambda row: float(row['System length'].split()[0]), axis=1)

table.to_excel('output.xlsx')

This will result in an Excel file looking similar to this, where the Stations and System length columns are numeric and not string.

Excel file now with Stations and System length as numbers and not strings

What’s next?

Want to learn more about Python and Excel?

Check out my online guide.

Create Excel Sheet with Stock Prices and Moving Average with Chart all from Python

What will we cover in this tutorial?

In this tutorial we will read a historic stock prices, calculate the moving average and export that to an Excel sheet and insert a chart with prices and moving average. And all will be done from Python using Pandas and Pandas-datareader.

Step 1: Get the historic stock prices

A great place to get historic stock prices is using Pandas-datareader. It has a great interface to various places to get the data. In this tutorial we will use Yahoo API through the Pandas-datareader.

It doesn’t require any registration to use the API. It works straight out of the box.

To get stock prices in time series you need to find the ticker of your favorite stock. In this tutorial we will use Apple, which has ticker AAPL.

import pandas_datareader as pdr
import datetime as dt

start = dt.datetime(2020, 1, 1)
ticker = pdr.get_data_yahoo("AAPL", start)

print(ticker.head())

You also need to set the date to set how far back you want historic stock prices. By default, you will get up to the most current date. Also, you will data for each day open. You can use more arguments to configure what if you want weekly or monthly prices. Also, you can set a end-date, if you like.

The above code should give output similar to the following.

                 High        Low       Open      Close       Volume  Adj Close
Date                                                                          
2020-01-02  75.150002  73.797501  74.059998  75.087502  135480400.0  73.840042
2020-01-03  75.144997  74.125000  74.287498  74.357498  146322800.0  73.122154
2020-01-06  74.989998  73.187500  73.447502  74.949997  118387200.0  73.704819
2020-01-07  75.224998  74.370003  74.959999  74.597504  108872000.0  73.358185
2020-01-08  76.110001  74.290001  74.290001  75.797501  132079200.0  74.538239

Step 2: Calculate the Moving Average

To calculate the moving average (also called the simple moving average), we can use the rolling method on a DataFrame.

The rolling method takes one argument, which is the window size. This indicates how big a window we want to apply a function on. In this case we want to apply the mean function on a window of size 50.

import pandas_datareader as pdr
import pandas as pd
import datetime as dt

start = dt.datetime(2020, 1, 1)
ticker = pdr.get_data_yahoo("AAPL", start)

ticker['MA'] = ticker['Close'].rolling(50).mean()

print(ticker[50:].head())

This calculates the simple moving average of window size 50. This will give the same result as the moving average with the default window of 50 will give in Yahoo! Finance.

The output will be as follows.

                 High        Low       Open  ...       Volume  Adj Close        MA
Date                                         ...                                  
2020-03-16  64.769997  60.000000  60.487499  ...  322423600.0  59.687832  76.16100
2020-03-17  64.402496  59.599998  61.877499  ...  324056000.0  62.312309  75.93815
2020-03-18  62.500000  59.279999  59.942501  ...  300233600.0  60.786911  75.67250
2020-03-19  63.209999  60.652500  61.847500  ...  271857200.0  60.321156  75.40445
2020-03-20  62.957500  57.000000  61.794998  ...  401693200.0  56.491634  75.03470

Notice we removed the first 50 rows (actually 51, as we index from 0). This is because the MA (moving average) column will not have numbers before we reach here.

Step 3: Export data to Excel and create a chart with close prices and moving average

Now this is where it all get’s a bit more complicated. It takes some reading in the manual to figure all this out.

The code is commented to explain what happens.

import pandas_datareader as pdr
import pandas as pd
import datetime as dt

# Read the stock prices from Yahoo! Finance
start = dt.datetime(2020, 1, 1)
ticker = pdr.get_data_yahoo("AAPL", start)

# Calculate the moving average with window size 50
ticker['MA'] = ticker['Close'].rolling(50).mean()

# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('apple.xlsx', engine='xlsxwriter')

# Name the sheet
sheet_name = "Apple"

# We convert the index from datetime to date
# This makes the data in Excel only have the date and
# not the date with time: 00:00:00:0000
ticker.index = ticker.index.date

# Skip the first 51 rows (to be pedantic, 49 is the correct number)
ticker = ticker[50:]

# Convert the dataframe to an XlsxWriter Excel object.
ticker.to_excel(writer, sheet_name=sheet_name)

# Get the xlsxwriter workbook and worksheet objects.
workbook = writer.book
worksheet = writer.sheets[sheet_name]

# Widen the first column to display the dates.
worksheet.set_column('A:A', 12)

# Get the number of rows and column index
max_row = len(ticker)
col_ma = ticker.columns.get_loc('MA') + 1
col_close = ticker.columns.get_loc('Close') + 1

# Create a chart object of type line
chart = workbook.add_chart({'type': 'line'})

# Insert the first dataset into chart
chart.add_series({
    'name': "MA",
    'categories': [sheet_name, 1, 0, max_row, 0],
    'values': [sheet_name, 1, col_ma, max_row, col_ma],
})
# Insert the second dataset in the same chart
chart.add_series({
    'name': "Close",
    'values': [sheet_name, 1, col_close, max_row, col_close],
})

# Configure axis
chart.set_x_axis({
    'name': 'Date',
    'date_axis': True,
})
chart.set_y_axis({
    'name': '$',
    'major_gridlines': {'visible': False}
})

# Insert the chart into the worksheet.
worksheet.insert_chart('I2', chart)

# Close the Pandas Excel writer and output the Excel file.
writer.save()

The above code will create an Excel sheet looking like this.

Result

How to Concatenate Multiple CSV Files and Export to Excel from Python

What will we cover in this tutorial?

In this tutorial we will show how you can concatenate CSV files of the same format and export it to an Excel sheet. All done in Python by using Pandas.

This is something you want in your Python toolbox as it will make it easy for you to get quick wins helping people with manual work.

Step 0: Preparation

We need multiple CSV files in this tutorial. If you have CSV files with the same format, then you can skip to the next step.

First, let’s create some random data by using the power of NumPy. This can be done by using the uniform method in the NumPy random package.

import numpy as np
import pandas as pd


for file_name in ['data_0.csv', 'data_1.csv', 'data_2.csv']:
    df = pd.DataFrame(np.random.uniform(0, 100, size=(100, 4)), columns=list('ABCD'))
    df.to_csv(file_name)

As you see, we use the nice way to generate the random data by calling the random.uniform, which takes the lower bound and upper bound of the uniform interval to get numbers from, and the size parameter is the dimension of the NumPy array. Here it generates 100 rows in 4 columns.

The DataFrame constructor can convert the NumPy array to the DataFrame and setting the column names to A, B, C, and D, respectively. This is done by converting the the string ‘ABCD‘ to a list, resulting in [‘A’, ‘B’, ‘C’, ‘D’].

The content is stored to a csv file by simply calling to_csv on the DataFrame object.

The files will look similar to this (only the beginning of one here).

,A,B,C,D
0,59.74500080886668,59.63123820144265,47.53898343676622,50.31013442167491
1,31.00881083223196,30.91970178258706,93.2538048567809,41.95421609366061
2,32.03557607643186,96.34180274801857,70.8926221704894,90.79417905937711
3,22.44980451436659,61.25077305144971,24.013823071196082,31.401361361506062
4,71.03827853534764,20.527320250264925,29.379395377514662,94.89076442880472
5,76.1462091142596,20.509207361552917,66.57572207423806,69.38906836001392

Step 1: Read a CSV file with Pandas

Now it is time for the fun stuff. Read one csv file into a Pandas DataFrame.

This can be done by calling the read_csv method in the Pandas library.

import pandas as pd


df = pd.read_csv("data_0.csv")
print(df)

This gives the following output.

    Unnamed: 0          A          B          C          D
0            0  59.745001  59.631238  47.538983  50.310134
1            1  31.008811  30.919702  93.253805  41.954216
2            2  32.035576  96.341803  70.892622  90.794179
3            3  22.449805  61.250773  24.013823  31.401361
4            4  71.038279  20.527320  29.379395  94.890764

Or the beginning of it.

Now we are ready to read all the content files and concatenate them.

Step 2: Concatenate Multiple DataFrames into One DataFrame

First we need to read all the csv files. This can be done in many ways. Here we will iterate over them and append them to a list.

After that we will concatenate them all.

This can be done by using the concat method in the Pandas library. The concat function takes an iterable as the the first argument. That is great, because a list is an iterable.

import pandas as pd


df_list = []
for file_name in ['data_0.csv', 'data_1.csv', 'data_2.csv']:
    df = pd.read_csv("data_0.csv")
    df_list.append(df)

df_concat = pd.concat(df_list)
print(df_concat)

Which results in a single DataFrame with 300 rows.

    Unnamed: 0          A          B          C          D
0            0  59.745001  59.631238  47.538983  50.310134
1            1  31.008811  30.919702  93.253805  41.954216
..         ...        ...        ...        ...        ...
98          98  65.831197  47.070415  65.205620  64.701177
99          99  27.259875  97.514085   8.444346  65.991040

[300 rows x 5 columns]

Notice that the index is reset for each concatenation. We can handle that by adding a ignore_index=True in the concat call.

import pandas as pd


df_list = []
for file_name in ['data_0.csv', 'data_1.csv', 'data_2.csv']:
    df = pd.read_csv("data_0.csv")
    df_list.append(df)

df_concat = pd.concat(df_list, ignore_index=True)
print(df_concat)

Which results in the following result.

     Unnamed: 0          A          B          C          D
0             0  59.745001  59.631238  47.538983  50.310134
1             1  31.008811  30.919702  93.253805  41.954216
..          ...        ...        ...        ...        ...
298          98  65.831197  47.070415  65.205620  64.701177
299          99  27.259875  97.514085   8.444346  65.991040

[300 rows x 5 columns]

Now ending at 299 rows.

Step 3: Write the output to an Excel file

This is simply done by using the to_excel method on the DataFrame object. It takes the file name as argument, and it should be of the format xlsx.

Notice, some installations of Pandas do not include the XlsxWriter module. Then you need to install that. You will know when you run the code below, it will say it is missing. Then you can install it by using the pip.

pip install xlsxwriter

Or if you are in PyCharm, you can simply write import xlsxwriter on a line and hover over the red line of xlsxwriter and select install. Then it should install it for you.

Now to the full code.

import pandas as pd


df_list = []
for file_name in ['data_0.csv', 'data_1.csv', 'data_2.csv']:
    df = pd.read_csv("data_0.csv")
    df_list.append(df)

df_concat = pd.concat(df_list, ignore_index=True)
df_concat.to_excel("output.xlsx")

This results in an Excel file similar to this.

Resulting output.

Quick Tutorial on Pandas to Excel in 3 Steps – Master the Basics

What will we cover in this tutorial?

We will cover the basic process of saving data from a Pandas DataFrame to an Excel sheet. This includes

  • The simple export of a DataFrame to Excel from Python.
  • How to set the Sheet name in Excel from Python.
  • To set which columns to export from the DataFrame to Excel.

Step 1: The simple export of a full DataFrame

Often you just need to figure out how to save the full DataFrame to an Excel sheet. This is often needed, as you have collected the data in Python, but want to explore it or work more on it in Excel.

If you are new to Pandas, we can recommend to read this tutorial.

Pandas is a great library to work with data similar to an Excel sheet. The advantage of Pandas is working with the data in Python and automate many things. That could be collecting data from API’s on the internet or simply automate the stuff you do manually in Excel.

Another beauty of Pandas is that it has great built in methods that does all the work for you.

Here we will demonstrate how a call to to_excel(…) can be done on a DataFrame.

import pandas as pd
import numpy as np


# Generate some random increasing data
df = pd.DataFrame(
    {'A': [np.random.uniform(0.1*i, 0.1*i + 1) for i in range(100)],
     'B': [np.random.uniform(0.1*i, 0.1*i + 1) for i in range(100)]}
)


df.to_excel("output.xlsx")

The code above creates a random DataFrame using the random function from NumPy.

Then it continues to call to_excel on the data frame with the output Excel file name. This will create an Excel sheet, which could look similar to this.

Step 2: How to set the sheet name in Excel from Python

Sometimes you want your data more organized. Here we are going to show how to set the sheet name where you export your data.

The default sheet name is Sheet1, hence in the previous step you would notice that the created sheet has that name.

Sometimes you want a different name. Luckily, this is simple.

import pandas as pd
import numpy as np


# Generate some random increasing data
df = pd.DataFrame(
    {'A': [np.random.uniform(0.1*i, 0.1*i + 1) for i in range(100)],
     'B': [np.random.uniform(0.1*i, 0.1*i + 1) for i in range(100)]}
)


df.to_excel("output.xlsx", sheet_name="My Sheet")

Notice that we only set the sheet_name parameter in the call. That is all it takes.

How did I know? I read the documentation.

This creates an Excel file with the desired sheet name.

Step 3: Only export the specific columns

Often it happens that you are only interested in exporting some of the columns in your DataFrame. Luckily, this is also simple to do from the standard export function on the DataFrame.

import pandas as pd
import numpy as np


# Generate some random increasing data
df = pd.DataFrame(
    {'A': [np.random.uniform(0.1*i, 0.1*i + 1) for i in range(100)],
     'B': [np.random.uniform(0.1*i, 0.1*i + 1) for i in range(100)],
     'C': [np.random.uniform(0.1*i, 0.1*i + 1) for i in range(100)],
     'D': [np.random.uniform(0.1*i, 0.1*i + 1) for i in range(100)],
     'E': [np.random.uniform(0.1*i, 0.1*i + 1) for i in range(100)],
     'F': [np.random.uniform(0.1*i, 0.1*i + 1) for i in range(100)]}
)


df.to_excel("output.xlsx", sheet_name="My Sheet", columns=['B', 'E', 'F'])

In the above example we create a DataFrame with a few columns. We only export columns B, E, and F.

This is set directly in the to_excel method, which can set the parameter columns. The parameter can accept a list of the column names to export to the Excel sheet.

Next step

This is only to get you started and see how simple it is to work with data from Python using Pandas. This enables you to make great things with only a few lines of code.

How to Export Pandas DataFrame to Excel and Create a Trendline Graph of Scatter Plot

What will we cover in this tutorial?

We will have some data in a Pandas DataFrame, which we want to export to an Excel sheet. Then we want to create a Scatter plot graph and fit that to a Excel trendline.

Step 1: Get the data

You might have some data already that you want to use. It can be from a HTML page (example) or CSV file.

For this purpose here we just generate some random data to use. We will use NumPy’s uniform function to generate it.

import pandas as pd
import numpy as np


# Generate some random increasing data
data = pd.DataFrame(
    {'A': [np.random.uniform(0.1*i, 0.1*i + 1) for i in range(100)],
     'B': [np.random.uniform(0.1*i, 0.1*i + 1) for i in range(100)]}
)
print(data)

Which will generate some slightly increasing data, which is nice to fit a graph to.

The output could look something like this.

            A          B
0    0.039515   0.778077
1    0.451888   0.210705
2    0.992493   0.961428
3    0.317536   1.046444
4    1.220419   1.388086

Step 2: Create an Excel XlsxWriter engine

This step might require that you install the XlsxWriter library, which is needed from the Pandas library.

This can be done by the following command.

pip install xlsxwriter

Now we can create the engine in our code.

import pandas as pd
import numpy as np


# Generate some random increasing data
data = pd.DataFrame(
    {'A': [np.random.uniform(0.1*i, 0.1*i + 1) for i in range(100)],
     'B': [np.random.uniform(0.1*i, 0.1*i + 1) for i in range(100)]}
)

# Create a Pandas Excel writer using XlsxWriter
excel_file = 'output.xlsx'
sheet_name = 'Data set'
writer = pd.ExcelWriter(excel_file, engine='xlsxwriter')

This will setup a Excel writer engine and be ready to write to file output.xlsx.

Step 3: Write the data to Excel and create a scatter graph with a fitted Trendline

This can be done by the following code, which uses the add_series function to insert a graph.

import pandas as pd
import numpy as np


# Generate some random increasing data
data = pd.DataFrame(
    {'A': [np.random.uniform(0.1*i, 0.1*i + 1) for i in range(100)],
     'B': [np.random.uniform(0.1*i, 0.1*i + 1) for i in range(100)]}
)

# Create a Pandas Excel writer using XlsxWriter
excel_file = 'output.xlsx'
sheet_name = 'Data set'
writer = pd.ExcelWriter(excel_file, engine='xlsxwriter')
data.to_excel(writer, sheet_name=sheet_name)

# Access the XlsxWriter workbook and worksheet objects from the dataframe.
workbook = writer.book
worksheet = writer.sheets[sheet_name]

# Create a scatter chart object.
chart = workbook.add_chart({'type': 'scatter'})

# Get the number of rows and column index
max_row = len(data)
col_x = data.columns.get_loc('A') + 1
col_y = data.columns.get_loc('B') + 1

# Create the scatter plot, use a trendline to fit it
chart.add_series({
    'name':       "Samples",
    'categories': [sheet_name, 1, col_x, max_row, col_x],
    'values':     [sheet_name, 1, col_y, max_row, col_y],
    'marker':     {'type': 'circle', 'size': 4},
    'trendline': {'type': 'linear'},
})

# Set name on axis
chart.set_x_axis({'name': 'Concentration'})
chart.set_y_axis({'name': 'Measured',
                  'major_gridlines': {'visible': False}})

# Insert the chart into the worksheet in field D2
worksheet.insert_chart('D2', chart)

# Close and save the Excel file
writer.save()

Result

The result should be similar to this.

The resulting Excel sheet.

That is how it can be done.