Data Management And Visualization Week-3 Assignment

3 min readSep 24, 2020

Hello guys, I am writing this blog as a part of the week-3 assignment for the coursera course named Data Management and Visualisation. The assignments are about writing one blog for each week presenting your research work done within the week.

So, in the week-3 the assignment is about running the program in the python(spyder IDE) and identify and discard the null values present within the dataset and categorising the variables

The task is to load the dataset and display the variables which I have decided to work on in the week-1. So, job is display which values are taken by the variables and how many times the values are taken by the variables.

1)Step 1:Make and implement data management decisions for the variables you selected.

In this task, we have to perform the work of removing the missing data, coding in valid data, recoding variables, creating secondary variables and binning or grouping variables.

As I had selected MORPHOLOGY_EJECTA_1,LATITUDE_CIRCLE_IMAGE and LONGITUDE_CIRCLE_IMAGE as my dataset variables I have first prepared a separate codebook for this.

After that For the MORPHOLOGY_EJECTA_1, it contains several morphologies which have empty values so I have replaced them with NaN value. After that there were morphologies which had more than one identifying type and some had only single identifying type. So, I have divided that into 2 parts: one with single identifying type and other with more than one identifying type.

For LATITUDE_CIRCLE_IMAGE and LONGITUDE_CIRCLE_IMAGE, I have first rounded off the values.Then I have latitude into 2 parts : one for northern hemisphere and one for southern hemisphere. Also I have divided longitude into 2 groups: (-180,0) and (0,180).

2)Step 2:Run frequency distributions for your chosen variables and select columns, and possibly rows.

I have represented the frequency distributions for each of these 3 variables and also displayed the percentage values for the MORPHOLOGY_EJECTA_1 variable.

Below is the code I have implemented:

#importing the libraries

import pandas

import numpy

import seaborn

import matplotlib.pyplot as plt

from IPython.display import display

#loading the dataset

data = pandas.read_csv(‘marscrater_pds.csv’, low_memory=False)

pandas.set_option(‘display.float_format’, lambda x:’%f’%x)

#create new columns where the significant digit has been reduced to 1

data[‘LATITUDE_CIRCLE_IMAGE’] = data.LATITUDE_CIRCLE_IMAGE.round(1)

data[‘LONGITUDE_CIRCLE_IMAGE’] = data.LONGITUDE_CIRCLE_IMAGE.round(1)

data[‘MORPHOLOGY_EJECTA_1’] = data[‘MORPHOLOGY_EJECTA_1’].astype(‘category’)

#Any crater with no designated morphology will be replaced with NaN

data[‘MORPHOLOGY_EJECTA_1’] = data[‘MORPHOLOGY_EJECTA_1’].replace(‘ ‘,numpy.NaN)

def identifysingletype(cratertype):

if ‘/’ in cratertype:

return 1

else:

return 0

print(“We will label and filter out craters”)

data2 = data.dropna(subset=[‘MORPHOLOGY_EJECTA_1’])

data2[‘MORPH_CATEGORY_1’] = data2[‘MORPHOLOGY_EJECTA_1’].apply(lambda x: identifysingletype(x))

print(“The count of MORPH_CATEGORY_1 variable is shown as below:”)

c3 = data2.groupby(“MORPH_CATEGORY_1”).size()

print(c3)

print(“The percentage of MORPH_CATEGORY_1 variable is as below:”)

p3 = data2.groupby(“MORPH_CATEGORY_1”).size() * 100/len(data)

print(p3)

data2[‘LATITUDE_GRP’] = pandas.cut(data2.LATITUDE_CIRCLE_IMAGE,[-90,0,90])

c4 = data2.groupby(“LATITUDE_GRP”).size()

print(c4)

data2[‘LONGITUDE_GRP’] = pandas.cut(data2.LONGITUDE_CIRCLE_IMAGE,[-180,0,180])

c5 = data2.groupby(“LONGITUDE_GRP”).size()

print(c5)

OUTPUT:

We will label and filter out craters

The count of MORPH_CATEGORY_1 variable is shown as below:

MORPH_CATEGORY_1

0 41241

1 3384

dtype: int64

The percentage of MORPH_CATEGORY_1 variable is as below:

MORPH_CATEGORY_1

0 10.730259

1 0.880464

dtype: float64

LATITUDE_GRP

(-90, 0] 25881

(0, 90] 18744

dtype: int64

LONGITUDE_GRP

(-180, 0] 19681

(0, 180] 24939

dtype: int64

Data Management And Visualization Week-3 Assignment

1)Step 1:Make and implement data management decisions for the variables you selected.

2)Step 2:Run frequency distributions for your chosen variables and select columns, and possibly rows.

OUTPUT:

Written by Aanshi Patwari