Data Analysis Tools Week-1

Aanshi Patwari
4 min readOct 27, 2020

Hello guys, I am writing this blog as a part of the week-1 assignment for the coursera course named Data Analysis Tools part of Data Analysis and Interpretation specialisation. The assignments are about writing one blog for each week presenting your research work done within the week.

So, in week-1 the assignment is about running an analysis of variance.Here,we need to analyze and interpret post hoc paired comparisons in instances where our original statistical test was significant, we were given to examine more than two groups (i.e. more than two levels of a categorical, explanatory variable).

STEP1 : Syntax used to run an ANOVA

ejectamorphbin = data3.groupby(‘MORPHOLOGY_EJECTA_1’).size()

ejectamorphbin = ejectamorphbin[ejectamorphbin > 10]

summarytable = pandas.DataFrame({‘COUNT’:ejectamorphbin}).reset_index()

labels = numpy.array(summarytable[‘MORPHOLOGY_EJECTA_1’])

counts = numpy.array(summarytable[‘COUNT’])

orderedarray = [‘Rd’, ‘SLEPS’, ‘SLERS’, ‘SLEPC’, ‘SLERC’, ‘DLERS’, ‘DLEPS’, ‘MLERS’,

‘DLEPC’, ‘DLERC’, ‘SLEPCPd’, ‘SLEPSPd’, ‘SLEPd’, ‘MLEPS’, ‘SLERSPd’]

print(‘This plot shows the distribution of crater morphology.’)

plotc3 = seaborn.barplot(x=labels,y=counts,order=orderedarray)

plt.xlabel(‘Crater Morphology Type’)

plt.title(‘Mars Crater Morphology Distribution’)

plt.xticks(rotation=’vertical’)

print(‘Let us now look at data with only the top 3 morphology types present’)

morphofinterest = [‘Rd’, ‘SLEPS’, ‘SLERS’]

data3 = data3[data3[‘MORPHOLOGY_EJECTA_1’].isin(morphofinterest)]

#create a new dataframe with the slice

latitude = numpy.array(data3[‘LATITUDE_CIRCLE_IMAGE’])

longitude = numpy.array(data3[‘LONGITUDE_CIRCLE_IMAGE’])

morphology = numpy.array(data3[‘MORPHOLOGY_EJECTA_1’])

data4 = pandas.DataFrame({‘LATITUDE_CIRCLE_IMAGE’:latitude,’LONGITUDE_CIRCLE_IMAGE’:longitude,’MORPHOLOGY_EJECTA_1':morphology})

print(‘We will bin the latitudes into 7 discrete bins of 30 degrees.’)

data4[‘LATITUDE_BIN’] = pandas.cut(data4[‘LATITUDE_CIRCLE_IMAGE’],[-90,-60,-30,0,30,60,90])

#print(‘We will bin the latitudes into 7 discrete bins of 30 degrees.’)

#data4[‘LONGITUDE_BIN’] = pandas.cut(data4[‘LONGITUDE_CIRCLE_IMAGE’],[-180,-90,0,90,180])

seaborn.factorplot(x=’LATITUDE_BIN’,y=’LONGITUDE_CIRCLE_IMAGE’,hue=’MORPHOLOGY_EJECTA_1',data=data4)

plt.xlabel(‘Latitude by Bin (Degrees)’)

plt.ylabel(‘Longitude (Degrees))’)

plt.title(‘Distribution of Martian Crater Diameter by Latitude’)

plt.xticks(rotation=’vertical’)

print(‘First we will investigate the mean and standard deviation for the 3 specific morphology types.’)

summarystatistics = data4.groupby(‘MORPHOLOGY_EJECTA_1’).describe()

r1 = summarystatistics.mean()

print(r1)

r2 = summarystatistics.std()

print(r2)

print(‘This will perform an ANOVA on whether the crater diameter varies with EJECTA MORPHOLOGY’)

model1 = smf.ols(formula=’LONGITUDE_CIRCLE_IMAGE ~ C(MORPHOLOGY_EJECTA_1)’, data=data4)

results1 = model1.fit()

print(results1.summary())

print(‘This will print out the Tukey HSD comparison between the three morphology types’)

mc1 = multi.MultiComparison(data4[‘LONGITUDE_CIRCLE_IMAGE’],data4[‘MORPHOLOGY_EJECTA_1’])

tukeyres1 = mc1.tukeyhsd()

print(tukeyres1.summary())

STEP 2 : Corresponding output

As there are many categories for the MORPHOLOGY_EJECTA_1, for this assignment and the rest other, I have chosen the first 3 morphologies which are occurring the highest number of times.

The analysis of the 3 most occurring morphologies with relation to longitude and latitude is as follows:

The mean and standard deviation of individual types and group are as follows:

First we will investigate the mean and standard deviation for the 3 specific morphology types.

LATITUDE_CIRCLE_IMAGE count 11556.333333

mean -5.187123

std 34.394704

min -82.566667

25% -30.400000

50% -8.333333

75% 18.800000

max 80.500000

LONGITUDE_CIRCLE_IMAGE count 11556.333333

mean 2.707338

std 99.181989

min -179.966667

25% -72.350000

50% 5.433333

75% 81.333333

max 180.000000

dtype: float64

LATITUDE_CIRCLE_IMAGE count 11549.184574

mean 7.151280

std 2.837039

min 3.493327

25% 11.001364

50% 7.392789

75% 4.794789

max 1.539480

LONGITUDE_CIRCLE_IMAGE count 11549.184574

mean 11.695492

std 3.036024

min 0.057735

25% 17.507927

50% 15.321336

75% 9.569918

max 0.000000

dtype: float64

This will perform an ANOVA on whether the crater diameter varies with EJECTA MORPHOLOGY

OLS Regression Results

============================================================================

Dep. Variable: LONGITUDE_CIRCLE_IMAGE R-squared: 0.008

Model: OLS Adj. R-squared: 0.008

Method: Least Squares F-statistic: 140.8

Date: Tue, 27 Oct 2020 Prob (F-statistic): 1.22e-61

Time: 08:06:55 Log-Likelihood: -2.0786e+05

No. Observations: 34669 AIC: 4.157e+05

Df Residuals: 34666 BIC: 4.158e+05

Df Model: 2

Covariance Type: nonrobust

============================================================================

coef std err t P>|t| [0.025 0.975]

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — -

Intercept 15.2783 0.616 24.802 0.000 14.071 16.486

C(MORPHOLOGY_EJECTA_1)[T.SLEPS] -14.5829 1.513 -9.641 0.000 -17.548 -11.618

C(MORPHOLOGY_EJECTA_1)[T.SLERS] -23.1299 1.528 -15.134 0.000 -26.126 -20.134

============================================================================

Omnibus: 4512.005 Durbin-Watson: 0.195

Prob(Omnibus): 0.000 Jarque-Bera (JB): 1285.147

Skew: -0.168 Prob(JB): 8.59e-280

Kurtosis: 2.119 Cond. №3.28

============================================================================

group1 group2 meandiff p-adj lower upper reject

— — — — — — — — — — — — — — — — — — — — — — — — — — -

Rd SLEPS -14.5829 0.001 -18.1283 -11.0376 True

Rd SLERS -23.1299 0.001 -26.7122 -19.5477 True

SLEPS SLERS -8.547 0.001 -13.1549 -3.9391 True

— — — — — — — — — — — — — — — — — — — — — — — — — — -

STEP 3 : A few sentences of interpretation

On performing the analysis and carrying out the research work on the ANOVA analysis of longitude along with morphology ejecta, the value of p comes out to be 0.001 which is less than 0.005, from which we can conclude that the categorical variable(MORPHOLOGY_EJECTA_1) and the quantitative variable(LONGTUDE_CIRC_IMG) are correlated. Similarly we can find that MORPHOLOGY_EJECTA_1 and LATITUDE_CIRC_IMG are also correlated.

--

--