Data Analysis Tools Week-1
Hello guys, I am writing this blog as a part of the week-1 assignment for the coursera course named Data Analysis Tools part of Data Analysis and Interpretation specialisation. The assignments are about writing one blog for each week presenting your research work done within the week.
So, in week-1 the assignment is about running an analysis of variance.Here,we need to analyze and interpret post hoc paired comparisons in instances where our original statistical test was significant, we were given to examine more than two groups (i.e. more than two levels of a categorical, explanatory variable).
STEP1 : Syntax used to run an ANOVA
ejectamorphbin = data3.groupby(‘MORPHOLOGY_EJECTA_1’).size()
ejectamorphbin = ejectamorphbin[ejectamorphbin > 10]
summarytable = pandas.DataFrame({‘COUNT’:ejectamorphbin}).reset_index()
labels = numpy.array(summarytable[‘MORPHOLOGY_EJECTA_1’])
counts = numpy.array(summarytable[‘COUNT’])
orderedarray = [‘Rd’, ‘SLEPS’, ‘SLERS’, ‘SLEPC’, ‘SLERC’, ‘DLERS’, ‘DLEPS’, ‘MLERS’,
‘DLEPC’, ‘DLERC’, ‘SLEPCPd’, ‘SLEPSPd’, ‘SLEPd’, ‘MLEPS’, ‘SLERSPd’]
print(‘This plot shows the distribution of crater morphology.’)
plotc3 = seaborn.barplot(x=labels,y=counts,order=orderedarray)
plt.xlabel(‘Crater Morphology Type’)
plt.title(‘Mars Crater Morphology Distribution’)
plt.xticks(rotation=’vertical’)
print(‘Let us now look at data with only the top 3 morphology types present’)
morphofinterest = [‘Rd’, ‘SLEPS’, ‘SLERS’]
data3 = data3[data3[‘MORPHOLOGY_EJECTA_1’].isin(morphofinterest)]
#create a new dataframe with the slice
latitude = numpy.array(data3[‘LATITUDE_CIRCLE_IMAGE’])
longitude = numpy.array(data3[‘LONGITUDE_CIRCLE_IMAGE’])
morphology = numpy.array(data3[‘MORPHOLOGY_EJECTA_1’])
data4 = pandas.DataFrame({‘LATITUDE_CIRCLE_IMAGE’:latitude,’LONGITUDE_CIRCLE_IMAGE’:longitude,’MORPHOLOGY_EJECTA_1':morphology})
print(‘We will bin the latitudes into 7 discrete bins of 30 degrees.’)
data4[‘LATITUDE_BIN’] = pandas.cut(data4[‘LATITUDE_CIRCLE_IMAGE’],[-90,-60,-30,0,30,60,90])
#print(‘We will bin the latitudes into 7 discrete bins of 30 degrees.’)
#data4[‘LONGITUDE_BIN’] = pandas.cut(data4[‘LONGITUDE_CIRCLE_IMAGE’],[-180,-90,0,90,180])
seaborn.factorplot(x=’LATITUDE_BIN’,y=’LONGITUDE_CIRCLE_IMAGE’,hue=’MORPHOLOGY_EJECTA_1',data=data4)
plt.xlabel(‘Latitude by Bin (Degrees)’)
plt.ylabel(‘Longitude (Degrees))’)
plt.title(‘Distribution of Martian Crater Diameter by Latitude’)
plt.xticks(rotation=’vertical’)
print(‘First we will investigate the mean and standard deviation for the 3 specific morphology types.’)
summarystatistics = data4.groupby(‘MORPHOLOGY_EJECTA_1’).describe()
r1 = summarystatistics.mean()
print(r1)
r2 = summarystatistics.std()
print(r2)
print(‘This will perform an ANOVA on whether the crater diameter varies with EJECTA MORPHOLOGY’)
model1 = smf.ols(formula=’LONGITUDE_CIRCLE_IMAGE ~ C(MORPHOLOGY_EJECTA_1)’, data=data4)
results1 = model1.fit()
print(results1.summary())
print(‘This will print out the Tukey HSD comparison between the three morphology types’)
mc1 = multi.MultiComparison(data4[‘LONGITUDE_CIRCLE_IMAGE’],data4[‘MORPHOLOGY_EJECTA_1’])
tukeyres1 = mc1.tukeyhsd()
print(tukeyres1.summary())
STEP 2 : Corresponding output
As there are many categories for the MORPHOLOGY_EJECTA_1, for this assignment and the rest other, I have chosen the first 3 morphologies which are occurring the highest number of times.
The analysis of the 3 most occurring morphologies with relation to longitude and latitude is as follows:
The mean and standard deviation of individual types and group are as follows:
First we will investigate the mean and standard deviation for the 3 specific morphology types.
LATITUDE_CIRCLE_IMAGE count 11556.333333
mean -5.187123
std 34.394704
min -82.566667
25% -30.400000
50% -8.333333
75% 18.800000
max 80.500000
LONGITUDE_CIRCLE_IMAGE count 11556.333333
mean 2.707338
std 99.181989
min -179.966667
25% -72.350000
50% 5.433333
75% 81.333333
max 180.000000
dtype: float64
LATITUDE_CIRCLE_IMAGE count 11549.184574
mean 7.151280
std 2.837039
min 3.493327
25% 11.001364
50% 7.392789
75% 4.794789
max 1.539480
LONGITUDE_CIRCLE_IMAGE count 11549.184574
mean 11.695492
std 3.036024
min 0.057735
25% 17.507927
50% 15.321336
75% 9.569918
max 0.000000
dtype: float64
This will perform an ANOVA on whether the crater diameter varies with EJECTA MORPHOLOGY
OLS Regression Results
============================================================================
Dep. Variable: LONGITUDE_CIRCLE_IMAGE R-squared: 0.008
Model: OLS Adj. R-squared: 0.008
Method: Least Squares F-statistic: 140.8
Date: Tue, 27 Oct 2020 Prob (F-statistic): 1.22e-61
Time: 08:06:55 Log-Likelihood: -2.0786e+05
No. Observations: 34669 AIC: 4.157e+05
Df Residuals: 34666 BIC: 4.158e+05
Df Model: 2
Covariance Type: nonrobust
============================================================================
coef std err t P>|t| [0.025 0.975]
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — -
Intercept 15.2783 0.616 24.802 0.000 14.071 16.486
C(MORPHOLOGY_EJECTA_1)[T.SLEPS] -14.5829 1.513 -9.641 0.000 -17.548 -11.618
C(MORPHOLOGY_EJECTA_1)[T.SLERS] -23.1299 1.528 -15.134 0.000 -26.126 -20.134
============================================================================
Omnibus: 4512.005 Durbin-Watson: 0.195
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1285.147
Skew: -0.168 Prob(JB): 8.59e-280
Kurtosis: 2.119 Cond. №3.28
============================================================================
group1 group2 meandiff p-adj lower upper reject
— — — — — — — — — — — — — — — — — — — — — — — — — — -
Rd SLEPS -14.5829 0.001 -18.1283 -11.0376 True
Rd SLERS -23.1299 0.001 -26.7122 -19.5477 True
SLEPS SLERS -8.547 0.001 -13.1549 -3.9391 True
— — — — — — — — — — — — — — — — — — — — — — — — — — -
STEP 3 : A few sentences of interpretation
On performing the analysis and carrying out the research work on the ANOVA analysis of longitude along with morphology ejecta, the value of p comes out to be 0.001 which is less than 0.005, from which we can conclude that the categorical variable(MORPHOLOGY_EJECTA_1) and the quantitative variable(LONGTUDE_CIRC_IMG) are correlated. Similarly we can find that MORPHOLOGY_EJECTA_1 and LATITUDE_CIRC_IMG are also correlated.