This is a good post on making visualisations with pandas data frame in Python. It covers uni-variate plots like histograms, line plots, density plots and multivariate plots like correlation plot matrix and scatter-plot matrix.

Before diving into feature engineering and data cleaning, it is a good idea to have a good understanding of the data.

  1. Does it contain empty fields or missing values
  2. Are the values meaningful or feasible, like age of a person being negative or something like 99999
  3. Does it contain significant number of outliers and do they follow any pattern

We will cover how to deal with incomplete rows, missing values, outliers and similar preprocessing tasks in a future post.

You will need for the following R packages :

  1. ggplot2 – For beautiful plots and histograms
  2. igraph – For working with graphs(set of nodes and set of edges) and also for visualizing it. It will be preliminary, so for detailed view, use Gephi. It is available for free.

I recommend you to go through this post for in-depth analysis.

In R, commands like str and summary help us achieve it.

Here, we use the Dress Attribute Sales dataset from UCI ML

str – Gives the schema of the data set, i.e, features, its levels(if categorical) and its data type

dress_data = = read.csv(file = "~/Downloads/Attribute DataSet.csv", header = TRUE, stringsAsFactors = FALSE)
str(dress_data)
'data.frame':	500 obs. of  14 variables:
 $ Dress_ID      : int  1006032852 1212192089 1190380701 966005983 876339541 1068332458 1220707172 1219677488 1113094204 985292672 ...
 $ Style         : chr  "Sexy" "Casual" "vintage" "Brief" ...
 $ Price         : chr  "Low" "Low" "High" "Average" ...
 $ Rating        : num  4.6 0 0 4.6 4.5 0 0 0 0 0 ...
 $ Size          : chr  "M" "L" "L" "L" ...
 $ Season        : chr  "Summer" "Summer" "Automn" "Spring" ...
 $ NeckLine      : chr  "o-neck" "o-neck" "o-neck" "o-neck" ...
 $ SleeveLength  : chr  "sleevless" "Petal" "full" "full" ...
 $ waiseline     : chr  "empire" "natural" "natural" "natural" ...
 $ Material      : chr  "null" "microfiber" "polyster" "silk" ...
 $ FabricType    : chr  "chiffon" "null" "null" "chiffon" ...
 $ Decoration    : chr  "ruffles" "ruffles" "null" "embroidary" ...
 $ Pattern.Type  : chr  "animal" "animal" "print" "print" ...
 $ Recommendation: int  1 0 0 1 0 0 0 0 1 1 ...

summary – Aggregate information of each column of the dataset, containing detals like mean, median, 1st and 3rd quartile, number of NA’s

summary(dress_data)
Dress_ID            Style              Price               Rating          Size              Season
 Min.   :4.443e+08   Length:500         Length:500         Min.   :0.000   Length:500         Length:500
 1st Qu.:7.673e+08   Class :character   Class :character   1st Qu.:3.700   Class :character   Class :character
 Median :9.083e+08   Mode  :character   Mode  :character   Median :4.600   Mode  :character   Mode  :character
 Mean   :9.055e+08                                         Mean   :3.529
 3rd Qu.:1.040e+09                                         3rd Qu.:4.800
 Max.   :1.254e+09                                         Max.   :5.000
   NeckLine         SleeveLength        waiseline           Material          FabricType         Decoration
 Length:500         Length:500         Length:500         Length:500         Length:500         Length:500
 Class :character   Class :character   Class :character   Class :character   Class :character   Class :character
 Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character  

 Pattern.Type       Recommendation
 Length:500         Min.   :0.00
 Class :character   1st Qu.:0.00
 Mode  :character   Median :0.00
                    Mean   :0.42
                    3rd Qu.:1.00
                    Max.   :1.00

We can also construct visualisations of our data using line plots, scatter plots and histogram in R.

 hist(dress_data$Rating) 

hist_dress_data

Advertisements