## 3 Easy Steps to Get Started With Machine Learning: Understand the Concept and Implement Linear Regression in Python

• How does Machine Learning work?
• A first example of Linear Regression in Python

Machine Learning is a hot topic these days and it is easy to get confused when people talk about it. But what is Machine Learning and how can it you?

I found the following explanation quite good. Classical vs modern (No machine learning vs machine learning) approach to predictions.

In the classical computing model every thing is programmed into the algorithms. This has the limitation that all decision logic need to be understood before usage. And if things change, we need to modify the program.

With the modern computing model (Machine Learning) this paradigm is changes. We feed the algorithms with data, and based on that data, we do the decisions in the program.

While this can seem abstract, this is a big change in thinking programming. Machine Learning has helped computers to have solutions to problems like:

• Improved search engine results.
• Voice recognition.
• Number plate recognition.
• Categorisation of pictures.
• …and the list goes on.

## Step 2: How does Machine Learning work?

On a high level you can divide Machine Learning into two phases.

• Phase 1: Learning
• Phase 2: Prediction

The Learning phase is divided into steps.

It all starts with a training set (training data). This data set should represent the type of data that the Machine Learn model should be used to predict from in Phase 2 (predction).

The pre-processing step is about cleaning up data. While the Machine Learning is awesome, it cannot figure out what good data looks like. You need to do the cleaning as well as transforming data into a desired format.

Then for the magic, the learning step. There are three main paradigms in machine learning.

• Supervised: where you tell the algorithm what categories each data item is in. Each data item from the training set is tagged with the right answer.
• Unsupervised: is when the learning algorithm is not told what to do with it and it should make the structure itself.
• Reinforcement: teaches the machine to think for itself based on past action rewards.

Finally, the testing is done to see if the model is good. The training data was divided into a test set and training set. The test set is used to see if the model can predict from it. If not, a new model might be necessary.

After that the Prediction Phase begins.

When the model has been created it will be used to predict based on it from new data.

## Step 3: For our first example of Linear Regression in Python

### Installing the libraries

Linear regression is a linear approach to modelling the relationship between a scalar response to one or more variables. In the case we try to model, we will do it for one single variable. Said in another way, we want map points on a graph to a line (y = a*x + b).

To do that, we need to import various libraries.

```# Importing matplotlib to make a plot
import matplotlib.pyplot as plt
# work with number as array
import numpy as np
# we want to use linear_model (that uses datasets)
from sklearn import linear_model
```

The matplotlib library is used to make a plot, but is a comprehensive library for creating static, animated, and interactive visualizations in Python. If you do not have it installed you can do that by typing in the following command in a terminal.

```pip install matplotlib
```

The numpy is a powerful library to calculate with N-dimensional arrays. If needed, you can install it by typing the following command in a terminal.

```pip install numpy
```

Finally, you need the linear_model from the sklearn library, which you can install by typing the following command in a terminal.

```pip install scikit-learn
```

### Training data set

This simple example will let you make a linear regression of an input of the following data set.

```# data set
prices = [245, 312, 279, 308, 199, 409, 200, 400, 230]
size = [50, 60, 35, 55, 30, 65, 30, 75, 25]
```

Here some items are sold, but each item has a size. The first item was sold for 245 (\$) and had a size of 50 (something). The next item was sold to 312 (\$) and had a size of 60 (something).

The sizes needs to be reshaped before we model it.

```# Importing matplotlib and numpy and sklearn
import matplotlib.pyplot as plt
# work with number as array
import numpy as np
# we want to use linear_model (that uses datasets)
from sklearn import linear_model

# data set
prices = [245, 312, 279, 308, 199, 409, 200, 400, 230]
size = [50, 60, 35, 55, 30, 65, 30, 75, 25]

# reshape the input for regression ( second argument how many items
size2 = np.array(size).reshape((-1, 1))
print(size2)
```

Which results in the following output.

```[







]
```

Hence, the reshape((-1, 1)) transforms it from a row to a single array.

Then for the linear regression.

```# Importing matplotlib and numpy and sklearn
import matplotlib.pyplot as plt
# work with number as array
import numpy as np
# we want to use linear_model (that uses datasets)
from sklearn import linear_model

# data set
prices = [245, 312, 279, 308, 199, 409, 200, 400, 230]
size = [50, 60, 35, 55, 30, 65, 30, 75, 25]

# reshape the input for regression ( second argument how many items
size2 = np.array(size).reshape((-1, 1))
print(size2)

regr = linear_model.LinearRegression()
regr.fit(size2, prices)
print("Coefficients", regr.coef_)
print("intercepts", regr.intercept_)
```

Which prints out the coefficient (a) and the intercept (b) of a formula y = a*x + b.

Now you can predict future prices, when given a size.

```# How to predict
size_new = 60
price = size_new * regr.coef_ + regr.intercept_
print(price)
print(regr.predict([[size_new]]))
```

Where you both can compute it directly (2nd line) or use the regression model (4th line).

Finally, you can plot the linear regression as a graph.

```# Importing matplotlib and numpy and sklearn
import matplotlib.pyplot as plt
# work with number as array
import numpy as np
# we want to use linear_model (that uses datasets)
from sklearn import linear_model

# data set
prices = [245, 312, 279, 308, 199, 409, 200, 400, 230]
size = [50, 60, 35, 55, 30, 65, 30, 75, 25]

# reshape the input for regression ( second argument how many items
size2 = np.array(size).reshape((-1, 1))
print(size2)

regr = linear_model.LinearRegression()
regr.fit(size2, prices)

# Here we plot the graph
x = np.array(range(20, 100))
y = eval('regr.coef_*x + regr.intercept_')
plt.plot(x, y)
plt.scatter(size, prices, color='black')
plt.ylabel('prices')
plt.xlabel('size')
plt.show()
```

Which results in the following graph.

## Conclusion

This is obviously a simple example of linear regression, as it only has one variable. This simple example shows you how to setup the environment in Python and how to make a simple plot.

## What will we learn in this tutorial

• How Machine Learning works and predicts.
• What you need to install to implement your Prediction Model in Python
• A simple way to implement a Prediction Model in Python with persistence
• How to simplify the connection to the Twitter API using tweepy
• Collect the training dataset from twitter interactively in a Python program
• Use the persistent model to predict the tweets you like

## Step 1: Quick introduction to Machine Learning Machine Learning: Input to Learner is Features X (data set) with Targets Y. The Learner outputs a Model, which can predict (Y) future inputs (X).
• The Leaner (or Machine Learning Algorithm) is the program that creates a machine learning model from the input data.
• The Features X is the dataset used by the Learner to generate the Model.
• The Target Y contains the categories for each data item in the Feature X dataset.
• The Model takes new inputs X (similar to those in Features) and predicts a target Y, from the categories in Target Y.

We will implement a simple model, that can predict Twitter feeds into two categories: allow and refuse.

## Step 2: Install sklearn library (skip if you already have it)

The Python code will be using the sklearn library.

You can install it, simply write the following in the command line (also see here).

```pip install scikit-learn
```

Alternatively, you might want to install it locally in your user space.

```pip install scikit-learn --user
```

## Step 3: Create a simple Prediction Model in Python to Train and Predict on tweets

The implementation accomplishes the the machine learning model in a class. The class has the following features.

• create_dataset: It creates a dataset by taking a list of data that are representing allow, and a list of data that represent the reject. The dataset is divided into features and targets
• train_dataset: When your dataset is loaded it should be trained to create the model, consisting of the predictor (transfer and estimator)
• predict: Is called after the model is trained. It can predict an input if it is in the allow category.
• persist: Is called to save the model for later use, such that we do not need to collect data and train it again. It should only be called after dataset has been created and the model has been train (after create_dataset and train_dataset)
```from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
import joblib

class PredictionModel:
def __init__(self):
self.predictor = {}
self.dataset = {'features': [], 'targets': []}
self.allow_id = 0
self.reject_id = 1

def create_dataset(self, allow_data, reject_data):
features_y = allow_data + reject_data
targets_x = [self.allow_id]*len(allow_data) + [self.reject_id]*len(reject_data)
self.dataset = {'features': features_y, 'targets': targets_x}

def train_dataset(self):
x_train, x_test, y_train, y_test = train_test_split(self.dataset['features'], self.dataset['targets'])

transfer = TfidfVectorizer()
x_train = transfer.fit_transform(x_train)
x_test = transfer.transform(x_test)

estimator = MultinomialNB()
estimator.fit(x_train, y_train)

score = estimator.score(x_test, y_test)
self.predictor = {'transfer': transfer, 'estimator': estimator}

def predict(self, text):
sentence_x = self.predictor['transfer'].transform([text])
y_predict = self.predictor['estimator'].predict(sentence_x)
return y_predict == self.allow_id

def persist(self, output_name):
joblib.dump(self.predictor['transfer'], output_name+".transfer")
joblib.dump(self.predictor['estimator'], output_name+".estimator")

```

## Step 4: Get a Twitter API access

```api_key = {
'consumer_key': "",
'consumer_secret': "",
'access_token': "",
'access_token_secret': ""
}
```

If you do not already have the tweepy library, then install it by.

```pip install tweepy
```

```import tweepy

def __init__(self, api_key):
# authentication of consumer key and secret
auth = tweepy.OAuthHandler(api_key['consumer_key'], api_key['consumer_secret'])

auth.set_access_token(api_key['access_token'], api_key['access_token_secret'])
self.api = tweepy.API(auth)

def get_tweets(self, user_name, number=0):
if number > 0:
return tweepy.Cursor(self.api.user_timeline, screen_name=user_name, tweet_mode="extended").items(number)
else:
return tweepy.Cursor(self.api.user_timeline, screen_name=user_name, tweet_mode="extended").items()
```
• __init__: The class sets up the Twitter API in the init-function.
• get_tweets: Returns the tweets from a user_name (screen_name).

## Step 6: Collect the dataset (Features X and Target Y) from Twitter

To simplify your life you will use the above TwitterConnection class and and PredictionModel class.

```def get_features(auth, user_name, output_name):
positives = []
negatives = []
for tweet in tweets:
print(tweet.full_text)
print("a/r/e (allow/reject/end)? ", end='')
response = input()
if response.lower() == 'y':
positives.append(tweet.full_text)
elif response.lower() == 'e':
break
else:
negatives.append(tweet.full_text)
model = PredictionModel()
model.create_dataset(positives, negatives)
model.train_dataset()
model.persist(output_name)
```

The function reads the tweets from user_name and prompts for each one of them whether it should be added to tweets you allow or reject.

When you do not feel like “training” your set more (i.e. collect more training data), then you can press e.

Then it will create the dataset and train it to finally persist it.

## Step 7: See how good it predicts your tweets based on your model

The following code will print the first number tweets that your model will allow by user_name.

```def fetch_tweets_prediction(auth, user_name, input_name, number):
model = PredictionModel()
for tweet in tweets:
if model.predict(tweet.full_text):
print(tweet.full_text)
number -= 1
if number < 0:
break
```

Then your final piece is to call it. Remember to fill out your values for the api_key.

```api_key = {
'consumer_key': "",
'consumer_secret': "",
'access_token': "",
'access_token_secret': ""
}

get_features(api_key, "@cnnbrk", "cnnbrk")
fetch_tweets_prediction(api_key, "@cnnbrk", "cnnbrk", 10)

```

## Conclusion

I trained my set by 30-40 tweets with the above code. From the training set it did not have any false positives (that is an allow which was a reject int eh dataset), but it did have false rejects.

The full code is here.

```from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
import joblib
import tweepy

class PredictionModel:
def __init__(self):
self.predictor = {}
self.dataset = {'features': [], 'targets': []}
self.allow_id = 0
self.reject_id = 1

def create_dataset(self, allow_data, reject_data):
features_y = allow_data + reject_data
targets_x = [self.allow_id]*len(allow_data) + [self.reject_id]*len(reject_data)
self.dataset = {'features': features_y, 'targets': targets_x}

def train_dataset(self):
x_train, x_test, y_train, y_test = train_test_split(self.dataset['features'], self.dataset['targets'])

transfer = TfidfVectorizer()
x_train = transfer.fit_transform(x_train)
x_test = transfer.transform(x_test)

estimator = MultinomialNB()
estimator.fit(x_train, y_train)

score = estimator.score(x_test, y_test)
self.predictor = {'transfer': transfer, 'estimator': estimator}

def predict(self, text):
sentence_x = self.predictor['transfer'].transform([text])
y_predict = self.predictor['estimator'].predict(sentence_x)
return y_predict == self.allow_id

def persist(self, output_name):
joblib.dump(self.predictor['transfer'], output_name+".transfer")
joblib.dump(self.predictor['estimator'], output_name+".estimator")

def __init__(self, api_key):
# authentication of consumer key and secret
auth = tweepy.OAuthHandler(api_key['consumer_key'], api_key['consumer_secret'])

auth.set_access_token(api_key['access_token'], api_key['access_token_secret'])
self.api = tweepy.API(auth)

def get_tweets(self, user_name, number=0):
if number > 0:
return tweepy.Cursor(self.api.user_timeline, screen_name=user_name, tweet_mode="extended").items(number)
else:
return tweepy.Cursor(self.api.user_timeline, screen_name=user_name, tweet_mode="extended").items()

def get_features(auth, user_name, output_name):
positives = []
negatives = []
for tweet in tweets:
print(tweet.full_text)
print("y/n/e (positive/negative/end)? ", end='')
response = input()
if response.lower() == 'y':
positives.append(tweet.full_text)
elif response.lower() == 'e':
break
else:
negatives.append(tweet.full_text)
model = PredictionModel()
model.create_dataset(positives, negatives)
model.train_dataset()
model.persist(output_name)

def fetch_tweets_prediction(auth, user_name, input_name, number):
model = PredictionModel()
for tweet in tweets:
if model.predict(tweet.full_text):
print("POS", tweet.full_text)
number -= 1
else:
pass
# print("NEG", tweet.full_text)
if number < 0:
break

api_key = {
'consumer_key': "_",
'consumer_secret': "_",
'access_token': "_-_",
'access_token_secret': "_"
}

get_features(api_key, "@cnnbrk", "cnnbrk")
fetch_tweets_prediction(api_key, "@cnnbrk", "cnnbrk", 10)

```

## What will you learn?

• How to predict from a dataset with Machine Learning
• How to implement that in Python
• How to get data from Twitter
• How to install the necessary libraries to do Machine Learning in Python

## Step 1: Install the necessary libraries

The sklearn library is a simple and efficient tools for predictive data analysis.

You can install it by typing in the following in your command line.

```pip install sklearn
```

It will most likely install a couple of more needed libraries.

```Collecting sklearn
Collecting scikit-learn
|████████████████████████████████| 7.2 MB 5.0 MB/s
Collecting numpy>=1.13.3
|████████████████████████████████| 15.2 MB 12.6 MB/s
Collecting joblib>=0.11
|████████████████████████████████| 298 kB 8.1 MB/s
Collecting scipy>=0.19.1
|████████████████████████████████| 28.8 MB 5.8 MB/s
Using legacy setup.py install for sklearn, since package 'wheel' is not installed.
Installing collected packages: numpy, joblib, threadpoolctl, scipy, scikit-learn, sklearn
Running setup.py install for sklearn ... done
Successfully installed joblib-0.15.1 numpy-1.18.4 scikit-learn-0.23.1 scipy-1.4.1 sklearn-0.0 threadpoolctl-2.1.0
```

As in my installation with numpy, joblib, threadpoolctl, scipy, and scikit-learn.

## Step 2: The dataset

The machine learning algorithm needs a dataset to train on. To make this tutorial simple, I only used a limited set. I looked through the top tweets from CNN Breaking and categorised them in positive and negative tweets (I know it can be subjective).

```negative = [
"Protesters who were marching from Minneapolis to St. Paul were tear gassed by police as they tried to cross the Lake Street Marshall Bridge ",
"The National Guard has been activated in Washington, D.C. to assist police handling protests around the White House",
"Police have been firing tear gas at the protesters near the 5th Precinct in Minneapolis, where some in the crowd have responded with projectiles of their own",
"Texas and Colorado have activated the National Guard respond to protests",
"The mayor of Rochester, New York, has declared a state of emergency and ordered a curfew from 9 p.m. Saturday to 7 a.m. Sunday",
"Cleveland, Ohio, has enacted a curfew that will go into effect at 8 p.m. Saturday and last through 8 a.m. Sunday",
"A police car appears to be on fire in Los Angeles. Police officers are holding back a line of demonstrators to prevent them from getting close to the car."
]

positive = [
"Two NASA astronauts make history with their successful launch into space aboard a SpaceX rocket",
"After questionable weather, officials give the all clear for the SpaceX launch",
"NASA astronauts Bob Behnken and Doug Hurley climb aboard SpaceX's Crew Dragon spacecraft as they prepare for a mission to the International Space Station",
"New York Gov. Andrew Cuomo signs a bill giving death benefits to families of frontline workers who died battling the coronavirus pandemic"
]
```

## Step 3: Train the model

The data needs to be categorised to be fed into the training algorithm. Hence, we will make the required structure of the data set.

```def prepare_data(positive, negative):
data = positive + negative
target = *len(positive) + *len(negative)
return {'data': data, 'target': target}
```

The actual training is done by using the sklearn library.

```from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split

def train_data_set(data_set):
x_train, x_test, y_train, y_test = train_test_split(data_set['data'], data_set['target'])

transfer = TfidfVectorizer()
x_train = transfer.fit_transform(x_train)
x_test = transfer.transform(x_test)

estimator = MultinomialNB()
estimator.fit(x_train, y_train)

score = estimator.score(x_test, y_test)
print("score：\n", score)
return {'transfer': transfer, 'estimator': estimator}
```

## Step 4: Get some tweets from CNN Breaking and predict

In order for this step to work you need to set up tokens for the twitter api. You can follow this tutorial in order to do that.

When you have that you can use the following code to get it running.

```import tweepy

consumer_key = "REPLACE WITH YOUR KEY"
consumer_secret = "REPLACE WITH YOUR SECRET"
access_token = "REPLACE WITH YOUR TOKEN"
access_token_secret = "REPLACE WITH YOUR TOKEN SECRET"

# authentication of consumer key and secret
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)

auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
return api

def mood_on_cnn(api, predictor):
stat = [0, 0]
for status in tweepy.Cursor(api.user_timeline, screen_name='@cnnbrk', tweet_mode="extended").items():
sentence_x = predictor['transfer'].transform([status.full_text])
y_predict = predictor['estimator'].predict(sentence_x)

stat[y_predict] += 1

return stat

```

## Step 5: Putting it all together

That is it.

```from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
import tweepy

negative = [
"Protesters who were marching from Minneapolis to St. Paul were tear gassed by police as they tried to cross the Lake Street Marshall Bridge ",
"The National Guard has been activated in Washington, D.C. to assist police handling protests around the White House",
"Police have been firing tear gas at the protesters near the 5th Precinct in Minneapolis, where some in the crowd have responded with projectiles of their own",
"Texas and Colorado have activated the National Guard respond to protests",
"The mayor of Rochester, New York, has declared a state of emergency and ordered a curfew from 9 p.m. Saturday to 7 a.m. Sunday",
"Cleveland, Ohio, has enacted a curfew that will go into effect at 8 p.m. Saturday and last through 8 a.m. Sunday",
"A police car appears to be on fire in Los Angeles. Police officers are holding back a line of demonstrators to prevent them from getting close to the car."
]

positive = [
"Two NASA astronauts make history with their successful launch into space aboard a SpaceX rocket",
"After questionable weather, officials give the all clear for the SpaceX launch",
"NASA astronauts Bob Behnken and Doug Hurley climb aboard SpaceX's Crew Dragon spacecraft as they prepare for a mission to the International Space Station",
"New York Gov. Andrew Cuomo signs a bill giving death benefits to families of frontline workers who died battling the coronavirus pandemic"
]

def prepare_data(positive, negative):
data = positive + negative
target = *len(positive) + *len(negative)
return {'data': data, 'target': target}

def train_data_set(data_set):
x_train, x_test, y_train, y_test = train_test_split(data_set['data'], data_set['target'])

transfer = TfidfVectorizer()
x_train = transfer.fit_transform(x_train)
x_test = transfer.transform(x_test)

estimator = MultinomialNB()
estimator.fit(x_train, y_train)

score = estimator.score(x_test, y_test)
print("score：\n", score)
return {'transfer': transfer, 'estimator': estimator}

consumer_key = "REPLACE WITH YOUR KEY"
consumer_secret = "REPLACE WITH YOUR SECRET"
access_token = "REPLACE WITH YOUR TOKEN"
access_token_secret = "REPLACE WITH YOUR TOKEN SECRET"

# authentication of consumer key and secret
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)

auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
return api

def mood_on_cnn(api, predictor):
stat = [0, 0]
for status in tweepy.Cursor(api.user_timeline, screen_name='@cnnbrk', tweet_mode="extended").items():
sentence_x = predictor['transfer'].transform([status.full_text])
y_predict = predictor['estimator'].predict(sentence_x)

stat[y_predict] += 1

return stat

data_set = prepare_data(positive, negative)
predictor = train_data_set(data_set)

stat = mood_on_cnn(api, predictor)

print(stat)
print("Mood (0 good, 1 bad)", stat/(stat + stat))
```

I got the following output on the day of writing this tutorial.

```score：
1.0
[751, 2455]
Mood (0 good, 1 bad) 0.765751715533375
```

I found that the breaking news items are quite negative in taste. Hence, it seems to predict that.