What is Narcissistic Personality Indicator and how does it connect to NumPy?
NumPy is an amazing library that makes analyzing data easy, especially numerical data.
In this tutorial we are going to analyze a survey with 11.000+ respondents from an interactive Narcissistic Personality Indicator (NPI) test.
Narcissism in personality trait generally conceived of as excessive self love. In Greek mythology Narcissus was a man who fell in love with his reflection in a pool of water.https://openpsychometrics.org/tests/NPI/
The only connection between NPI and NumPy is that we want to analyze the 11.000+ answers.
The dataset can be downloaded here, which consists of a comma separated file, or CSV file for short and a description.
Step 1: Import the dataset and explore it
NumPy has thought of it for us, as simple as magic to load the dataset (in from the link above).
import numpy as np # This magic line loads the 11.000+ lines of data to a ndarray data = np.genfromtxt('data.csv', delimiter=',', dtype='int') # Skip first row data = data[1:] print(data)
And we print a summary out.
[[ 18 2 2 ... 211 1 50] [ 6 2 2 ... 149 1 40] [ 27 1 2 ... 168 1 28] ... [ 6 1 2 ... 447 2 33] [ 12 2 2 ... 167 1 24] [ 18 1 2 ... 291 1 36]]
A good idea is to investigate it from a spreadsheet as well to investigate it.
And the far end.
Oh, that end.
Then investigate the description from the dataset. (Here we have some of it).
For questions 1=40 which choice they chose was recorded per the following key. ... [The questions Q1 ... Q40] ... gender. Chosen from a drop down list (1=male, 2=female, 3=other; 0=none was chosen). age. Entered as a free response. Ages below 14 have been ommited from the dataset. -- CALCULATED VALUES -- elapse. (time submitted)-(time loaded) of the questions page in seconds. score. = ((int) $_POST['Q1'] == 1) ... [How it is calculated]
That means we score, answers to questions, elapsed time to answer, gender and age.
Reading a bit more, it says that a high score is an indicator for having narcissistic traits, but one should not conclude that it is one.
Step 2: Men or Women highest NPI?
I’m glad you asked.
import numpy as np data = np.genfromtxt('data.csv', delimiter=',', dtype='int') # Skip first row data = data[1:] # Extract all the NPI scores (first column) npi_score = data[:,0] print("Average score", npi_score.mean()) print("Men average", npi_score[data[:,42] == 1].mean()) print("Women average", npi_score[data[:,42] == 2].mean()) print("None average", npi_score[data[:,42] == 0].mean()) print("Other average", npi_score[data[:,42] == 3].mean())
Before looking at the result, see how nice the data the first column is sliced out to the view in npi_score. Then notice how easy you can calculate the mean based on a conditional rules to narrow the view.
Average score 13.29965311749533 Men average 14.195953307392996 Women average 12.081829626521191 None average 11.916666666666666 Other average 14.85
I guess you guessed it. Men score higher.
Step 3: Is there a correlation between age and NPI score?
I wonder about that too.
How can we figure that out? Wait, let’s ask our new friend NumPy.
import numpy as np import matplotlib.pyplot as plt data = np.genfromtxt('data.csv', delimiter=',', dtype='int') # Skip first row data = data[1:] # Extract all the NPI scores (first column) npi_score = data[:,0] age = data[:,43] # Some age values are not real, so we adjust them to 0 age[age>100] = 0 # Scatter plot them all with alpha=0.05 plt.scatter(age, npi_score, color='r', alpha=0.05) plt.show()
That looks promising. But can we just conclude that younger people score higher NPI?
What if most respondent are young, then that would make the picture more dense in the younger end (15-30). The danger with your eye is making fast conclusions.
Luckily, NumPy can help us there as well.
Correlation of NPI score and age: [[ 1. -0.23414633] [-0.23414633 1. ]]
What does that mean? Well, looking at the documentation of np.corroef():
Return Pearson product-moment correlation coefficients.https://numpy.org/doc/stable/reference/generated/numpy.corrcoef.html
It has a negative correlation, which means that the younger the higher NPI score. Values between 0.0 and -0.3 are considered low.
Is the Pearson product-moment correlation the correct one to use?
Step 4: (Optional) Let’s try to see if there is a correlation between NPI score and time elapsed
Same code, different column.
import numpy as np import matplotlib.pyplot as plt data = np.genfromtxt('data.csv', delimiter=',', dtype='int') # Skip first row data = data[1:] # Extract all the NPI scores (first column) npi_score = data[:,0] elapse = data[:,41] elapse[elapse > 2000] = 2000 # Scatter plot them all with alpha=0.05 plt.scatter(elapse, npi_score, color='r', alpha=0.05) plt.show()
Again, it is tempting to conclude something here. We need to remember that the mean value is around 13, hence, most data will be around there.
If we use the same calculation.
print("Correlation of NPI score and time elapse:") print(np.corrcoef(npi_score, elapse))
Correlation of NPI score and time elapse: [[1. 0.0147711] [0.0147711 1. ]]
Hence, here the there is close to no correlation.
Use the scientific tools to conclude. Do not rely on you eyes to determine whether there is a correlation.
The above gives an idea on how easy it is to work with numerical data in NumPy.