How To Get Started with a Predictive Machine Learning Program in Python in 5 Easy Steps

What will you learn?

  • How to predict from a dataset with Machine Learning
  • How to implement that in Python
  • How to get data from Twitter
  • How to install the necessary libraries to do Machine Learning in Python

Step 1: Install the necessary libraries

The sklearn library is a simple and efficient tools for predictive data analysis.

You can install it by typing in the following in your command line.

pip install sklearn

It will most likely install a couple of more needed libraries.

Collecting sklearn
  Downloading sklearn-0.0.tar.gz (1.1 kB)
Collecting scikit-learn
  Downloading scikit_learn-0.23.1-cp38-cp38-macosx_10_9_x86_64.whl (7.2 MB)
     |████████████████████████████████| 7.2 MB 5.0 MB/s 
Collecting numpy>=1.13.3
  Downloading numpy-1.18.4-cp38-cp38-macosx_10_9_x86_64.whl (15.2 MB)
     |████████████████████████████████| 15.2 MB 12.6 MB/s 
Collecting joblib>=0.11
  Downloading joblib-0.15.1-py3-none-any.whl (298 kB)
     |████████████████████████████████| 298 kB 8.1 MB/s 
Collecting threadpoolctl>=2.0.0
  Downloading threadpoolctl-2.1.0-py3-none-any.whl (12 kB)
Collecting scipy>=0.19.1
  Downloading scipy-1.4.1-cp38-cp38-macosx_10_9_x86_64.whl (28.8 MB)
     |████████████████████████████████| 28.8 MB 5.8 MB/s 
Using legacy setup.py install for sklearn, since package 'wheel' is not installed.
Installing collected packages: numpy, joblib, threadpoolctl, scipy, scikit-learn, sklearn
    Running setup.py install for sklearn ... done
Successfully installed joblib-0.15.1 numpy-1.18.4 scikit-learn-0.23.1 scipy-1.4.1 sklearn-0.0 threadpoolctl-2.1.0

As in my installation with numpy, joblib, threadpoolctl, scipy, and scikit-learn.

Step 2: The dataset

The machine learning algorithm needs a dataset to train on. To make this tutorial simple, I only used a limited set. I looked through the top tweets from CNN Breaking and categorised them in positive and negative tweets (I know it can be subjective).

negative = [
    "Protesters who were marching from Minneapolis to St. Paul were tear gassed by police as they tried to cross the Lake Street Marshall Bridge ",
    "The National Guard has been activated in Washington, D.C. to assist police handling protests around the White House",
    "Police have been firing tear gas at the protesters near the 5th Precinct in Minneapolis, where some in the crowd have responded with projectiles of their own",
    "Texas and Colorado have activated the National Guard respond to protests",
    "The mayor of Rochester, New York, has declared a state of emergency and ordered a curfew from 9 p.m. Saturday to 7 a.m. Sunday",
    "Cleveland, Ohio, has enacted a curfew that will go into effect at 8 p.m. Saturday and last through 8 a.m. Sunday",
    "A police car appears to be on fire in Los Angeles. Police officers are holding back a line of demonstrators to prevent them from getting close to the car."
            ]
positive = [
    "Two NASA astronauts make history with their successful launch into space aboard a SpaceX rocket",
    "After questionable weather, officials give the all clear for the SpaceX launch",
    "NASA astronauts Bob Behnken and Doug Hurley climb aboard SpaceX's Crew Dragon spacecraft as they prepare for a mission to the International Space Station",
    "New York Gov. Andrew Cuomo signs a bill giving death benefits to families of frontline workers who died battling the coronavirus pandemic"
]

Step 3: Train the model

The data needs to be categorised to be fed into the training algorithm. Hence, we will make the required structure of the data set.

def prepare_data(positive, negative):
    data = positive + negative
    target = [0]*len(positive) + [1]*len(negative)
    return {'data': data, 'target': target}

The actual training is done by using the sklearn library.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
def train_data_set(data_set):
    x_train, x_test, y_train, y_test = train_test_split(data_set['data'], data_set['target'])
    transfer = TfidfVectorizer()
    x_train = transfer.fit_transform(x_train)
    x_test = transfer.transform(x_test)
    estimator = MultinomialNB()
    estimator.fit(x_train, y_train)
    score = estimator.score(x_test, y_test)
    print("score:\n", score)
    return {'transfer': transfer, 'estimator': estimator}

Step 4: Get some tweets from CNN Breaking and predict

In order for this step to work you need to set up tokens for the twitter api. You can follow this tutorial in order to do that.

When you have that you can use the following code to get it running.

import tweepy

def setup_twitter():
    consumer_key = "REPLACE WITH YOUR KEY"
    consumer_secret = "REPLACE WITH YOUR SECRET"
    access_token = "REPLACE WITH YOUR TOKEN"
    access_token_secret = "REPLACE WITH YOUR TOKEN SECRET"
    # authentication of consumer key and secret
    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
    # authentication of access token and secret
    auth.set_access_token(access_token, access_token_secret)
    api = tweepy.API(auth)
    return api

def mood_on_cnn(api, predictor):
    stat = [0, 0]
    for status in tweepy.Cursor(api.user_timeline, screen_name='@cnnbrk', tweet_mode="extended").items():
        sentence_x = predictor['transfer'].transform([status.full_text])
        y_predict = predictor['estimator'].predict(sentence_x)
        stat[y_predict[0]] += 1
    return stat

Step 5: Putting it all together

That is it.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
import tweepy

negative = [
    "Protesters who were marching from Minneapolis to St. Paul were tear gassed by police as they tried to cross the Lake Street Marshall Bridge ",
    "The National Guard has been activated in Washington, D.C. to assist police handling protests around the White House",
    "Police have been firing tear gas at the protesters near the 5th Precinct in Minneapolis, where some in the crowd have responded with projectiles of their own",
    "Texas and Colorado have activated the National Guard respond to protests",
    "The mayor of Rochester, New York, has declared a state of emergency and ordered a curfew from 9 p.m. Saturday to 7 a.m. Sunday",
    "Cleveland, Ohio, has enacted a curfew that will go into effect at 8 p.m. Saturday and last through 8 a.m. Sunday",
    "A police car appears to be on fire in Los Angeles. Police officers are holding back a line of demonstrators to prevent them from getting close to the car."
            ]
positive = [
    "Two NASA astronauts make history with their successful launch into space aboard a SpaceX rocket",
    "After questionable weather, officials give the all clear for the SpaceX launch",
    "NASA astronauts Bob Behnken and Doug Hurley climb aboard SpaceX's Crew Dragon spacecraft as they prepare for a mission to the International Space Station",
    "New York Gov. Andrew Cuomo signs a bill giving death benefits to families of frontline workers who died battling the coronavirus pandemic"
]

def prepare_data(positive, negative):
    data = positive + negative
    target = [0]*len(positive) + [1]*len(negative)
    return {'data': data, 'target': target}

def train_data_set(data_set):
    x_train, x_test, y_train, y_test = train_test_split(data_set['data'], data_set['target'])
    transfer = TfidfVectorizer()
    x_train = transfer.fit_transform(x_train)
    x_test = transfer.transform(x_test)
    estimator = MultinomialNB()
    estimator.fit(x_train, y_train)
    score = estimator.score(x_test, y_test)
    print("score:\n", score)
    return {'transfer': transfer, 'estimator': estimator}

def setup_twitter():
    consumer_key = "REPLACE WITH YOUR KEY"
    consumer_secret = "REPLACE WITH YOUR SECRET"
    access_token = "REPLACE WITH YOUR TOKEN"
    access_token_secret = "REPLACE WITH YOUR TOKEN SECRET"
    # authentication of consumer key and secret
    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
    # authentication of access token and secret
    auth.set_access_token(access_token, access_token_secret)
    api = tweepy.API(auth)
    return api

def mood_on_cnn(api, predictor):
    stat = [0, 0]
    for status in tweepy.Cursor(api.user_timeline, screen_name='@cnnbrk', tweet_mode="extended").items():
        sentence_x = predictor['transfer'].transform([status.full_text])
        y_predict = predictor['estimator'].predict(sentence_x)
        stat[y_predict[0]] += 1
    return stat

data_set = prepare_data(positive, negative)
predictor = train_data_set(data_set)
api = setup_twitter()
stat = mood_on_cnn(api, predictor)
print(stat)
print("Mood (0 good, 1 bad)", stat[1]/(stat[0] + stat[1]))

I got the following output on the day of writing this tutorial.

score:
 1.0
[751, 2455]
Mood (0 good, 1 bad) 0.765751715533375

I found that the breaking news items are quite negative in taste. Hence, it seems to predict that.

How to Reformat a Text File in Python

The input file and the desired output

The task is to reformat the following input format.

Computing
“I do not fear computers. I fear lack of them.”
— Isaac Asimov
“A computer once beat me at chess, but it was no match for me at kick boxing.”
— Emo Philips
“Computer Science is no more about computers than astronomy is about telescopes.”
— Edsger W. Dijkstra

To the following output format.

“I do not fear computers. I fear lack of them.” (Isaac Asimov)
“A computer once beat me at chess, but it was no match for me at kick boxing.” (Emo Philips)
“Computer Science is no more about computers than astronomy is about telescopes.” (Edsger W. Dijkstra)

The Python code doing the job

The following simple code could do the reformatting in less than a second for a file that contained multiple hundreds quotes.

file = open("input")
content = file.readlines()
file.close()
lines = []
next_line = ""
for line in content:
    line = line.strip()
    if len(line) > 0 and len(line.split()) > 1:
        if line[0] == '“':
            next_line = line
        elif line[0] == '—':
            next_line += " (" + line[2:] + ")"
            lines.append(next_line)
            next_line = ""

file = open("output", "w")
for line in lines:
    file.write(line + "\n")
file.close()

How to Fetch CNN Breaking Tweets and Make Simple Statistics Automated with Python

What will we cover

  • We will use the tweepy library
  • Read the newest tweets from CNN Breaking
  • Make simple word statistics on the news tweets
  • See if we can learn anything from it

Preliminaries

The Code that does the magic

import tweepy
# personal details insert your key, secret, token and token_secret here
consumer_key = ""
consumer_secret = ""
access_token = ""
access_token_secret = ""
# authentication of consumer key and secret
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
# authentication of access token and secret
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
# Creation of the actual interface, using authentication
api = tweepy.API(auth)
# Use a dictionary to count the appearances of words
stat = {}
# Read the tweets from @cnnbrk and make the statistics
for status in tweepy.Cursor(api.user_timeline, screen_name='@cnnbrk', tweet_mode="extended").items():
    for word in status.full_text.split():
        if word in stat:
            stat[word] += 1
        else:
            stat[word] = 1
# Let's just print the top 10
top = 10
# Let us sort them on the value in reverse order to get the highest first
for word in sorted(stat, key=stat.get, reverse=True):
    # leave out all the small words
    if len(word) > 6:
        print(word, stat[word])
        top -= 1
        if top < 0:
            break

The result of the above (done May 30th, 2020)

coronavirus 441
@CNNPolitics: 439
President 380
updates: 290
impeachment 148
officials 130
according 100
Trump's 98
Democratic 96
against 88
Department 83

The coronavirus is still the most breaking subject of today.

Next steps

  • It should be extended to have a more intelligent interpretation of the data.

Understand the Password Validation in Mac in 3 Steps – Implement the Validation in Python

What will you learn?

  • The password validation process in Mac
  • How to extract the password validation values
  • Implementing the check in Python
  • Understand why the values are as they are
  • The importance of using a salt value with the password
  • Learn why the hash function is iterated multiple times

The Mac password validation process

Every time you log into your Mac it needs to verify that you used the correct password before giving you access.

The validation process reads hash, salt and iteration values from storage and uses them to validate your password.

The 3 steps below helps you to locate your values and how the validation process is done.

Step 1: Locating and extracting the hash, salt and iteration values

You need to use a terminal to extract the values. By using the following command you should get it printed in a readable way.

sudo defaults read /var/db/dslocal/nodes/Default/users/<username>.plist ShadowHashData | tr -dc 0-9a-f | xxd -r -p | plutil -convert xml1 - -o -

Where you need to exchange <username> with your actual user name. The command will prompt you for admin password.

This should result in an output similar to this.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
	<key>SALTED-SHA512-PBKDF2</key>
	<dict>
		<key>entropy</key>
		<data>
                1meJW2W6Zugz3rKm/n0yysV+5kvTccA7EuGejmyIX8X/MFoPxmmbCf3BE62h
                6wGyWk/TXR7pvXKg\njrWjZyI+Fc3aKfv1LNQ0/Qrod3lVJcWd9V6Ygt+MYU
                8Eptv3uwDcYf6Z5UuF+Hg67rpoDAWhJrC1\nPEfL3vcN7IoBqC5NkIU=
		</data>
		<key>iterations</key>
		<integer>45454</integer>
		<key>salt</key>
		<data>
		6VuJKkHVTdDelbNMPBxzw7INW2NkYlR/LoW4OL7kVAI=
		</data>
	</dict>
</dict>
</plist>

Step 2: Understand the output

The output consists of four pieces.

  • Key value: SALTED-SHA512-PBKDF2
  • Entropy: Base64 encoded data.
  • Number of iteration: 45454
  • Salt: Base64 encoded data

The Key value is the tells you which algorithm is used (SHA512) and how it is used (PBKDF2).

The entropy is the actual result of the validation algorithm determined by the key value . This “value” is not an encryption of the password, which means you cannot recover the password from that value, but you can validate if the password matches this value.

Confused? I know. But you will understand when we implement the solution

The number of iterations, here 45454, is the number of times the hash function is called. Also, why would you call the hash function multiple times? Follow along and you will see.

Finally, we have the salt value. That is to ensure that you cannot determine the password from the entropy value itself. This will also get explained with example below.

Step 3: Validating the password with Python

Before we explain the above, we need to be have Python do the check of the password.

import hashlib
import base64
iterations = 45454
salt = base64.b64decode("6VuJKkHVTdDelbNMPBxzw7INW2NkYlR/LoW4OL7kVAI=".encode())
password = "password".encode()
value = hashlib.pbkdf2_hmac('sha512', password, salt, iterations, 128)
print(base64.b64encode(value))

Which will generate the following output

b'1meJW2W6Zugz3rKm/n0yysV+5kvTccA7EuGejmyIX8X/MFoPxmmbCf3BE62h6wGyWk/TXR7pvXKgjrWjZyI+Fc3aKfv1LNQ0/Qrod3lVJcWd9V6Ygt+MYU8Eptv3uwDcYf6Z5UuF+Hg67rpoDAWhJrC1PEfL3vcN7IoBqC5NkIU='

That matches the entropy content of the file.

So what happened in the above Python code?

We use the hashlib library to do all the work for us. It takes the algorithm (sha512), the password (Yes, I used the password ‘password’ in this example, you should not actually use that for anything you want to keep secret from the public), the salt and the number of iterations.

Now we are ready to explore the questions.

Why use a Hash value and not an encryption of the password?

If the password was encrypted, then an admin on your network would be able to decrypt it and misuse it.

Hence, to keep it safe from that, an iterated hash value of your password is used.

A hash function is a one-way function that can map any input to a fixed sized output. A hash function will have these important properties in regards to passwords.

  • It will always map the same input to the same output. Hence, your password will always be mapped to the same value.
  • A small change in the input will give a big change in output. Hence, if you change one character in the password (say, from ‘password’ to ‘passward’) the hash value will be totally different.
  • It is not easy to find the given input to a hash value. Hence, it is not easily feasible to find your password given the hash value.

Why use multiple iterations of the hash function?

To slow it down.

Basically, the way your find passwords is by trying all possibilities. You try ‘a’ and map it to check if that gives the password. Then you try ‘b’ and see.

If that process is slow, you decrease the odds of someone finding your password.

To demonstrate this we can use the cProfile library to investigate the difference in run-time. First let us try it with the 45454 iterations in the hash function.

import hashlib
import base64
import cProfile

def crack_password(entropy, iterations, salt):
    alphabet = "abcdefghijklmnopqrtsuvwxyz"
    for c1 in alphabet:
        for c2 in alphabet:
            password = str.encode(c1 + c2)
            value = base64.b64encode(hashlib.pbkdf2_hmac('sha512', password, salt, iterations, 128))
            if value == entropy:
                return password

entropy = "kRqabDBsvkyAhpzzVWJtdqbtqgkgNPwr5gqWG6jvw73hxc7CCvC4E33WyR5bxKmAXG5vAG9/ue+DC7BYLHRfOTE/dLKSMdpE9RFH7ZlTp7GHdH5b5vaqQCcKlXAwkky786zvpucDIgGGTOyw6kKB5hqIXLX9chDvcPQksVrjmUs=".encode()
iterations = 45454
salt = base64.b64decode("6VuJKkHVTdDelbNMPBxzw7INW2NkYlR/LoW4OL7kVAI=".encode())
cProfile.run("crack_password(entropy, iterations, salt)")

This results in a run time of.

        1    0.011    0.011   58.883   58.883 ShadowFile.py:6(crack_password)

About 1 minute.

If we change the number of iterations to 1.

import hashlib
import base64
import cProfile

def crack_password(entropy, iterations, salt):
    alphabet = "abcdefghijklmnopqrtsuvwxyz"
    for c1 in alphabet:
        for c2 in alphabet:
            password = str.encode(c1 + c2)
            value = base64.b64encode(hashlib.pbkdf2_hmac('sha512', password, salt, iterations, 128))
            if value == entropy:
                return password

entropy = "kRqabDBsvkyAhpzzVWJtdqbtqgkgNPwr5gqWG6jvw73hxc7CCvC4E33WyR5bxKmAXG5vAG9/ue+DC7BYLHRfOTE/dLKSMdpE9RFH7ZlTp7GHdH5b5vaqQCcKlXAwkky786zvpucDIgGGTOyw6kKB5hqIXLX9chDvcPQksVrjmUs=".encode()
iterations = 1
salt = base64.b64decode("6VuJKkHVTdDelbNMPBxzw7INW2NkYlR/LoW4OL7kVAI=".encode())
cProfile.run("crack_password(entropy, iterations, salt)")

I guess you are not surprised it takes less than 1 second.

        1    0.002    0.002    0.010    0.010 ShadowFile.py:6(crack_password)

Hence, you can check way more passwords if only iterated 1 time.

Why use a Salt?

This is interesting.

Well, say that another user used the password ‘password’ and there was no salt.

import hashlib
import base64
iterations = 45454
salt = base64.b64decode("".encode())
password = "password".encode()
value = hashlib.pbkdf2_hmac('sha512', password, salt, iterations, 128)
print(base64.b64encode(value))
b'kRqabDBsvkyAhpzzVWJtdqbtqgkgNPwr5gqWG6jvw73hxc7CCvC4E33WyR5bxKmAXG5vAG9/ue+DC7BYLHRfOTE/dLKSMdpE9RFH7ZlTp7GHdH5b5vaqQCcKlXAwkky786zvpucDIgGGTOyw6kKB5hqIXLX9chDvcPQksVrjmUs='

Then you would get the same hash value.

Hence, for each user password, there is a new random salt used.

How to proceed from here?

If you want to crack passwords, then I would recommend you use Hashcat.