Top 10 Python packages for Data Science

The top 10 Python packages you must have are :

  1. scikit-learn – This is THE MOST IMPORTANT package for any ML project. Be it PCA, preprocessing and splitting datasets and the entire pipeline. Its suite of ML models makes you train and test with any random models a breeze. Its rich documentation along with User Guide is of tremendous help for everyone.
  2. pandas – Required for importing and exporting data into workable formats for coding in Python
  3. numpy – For the numerical operations, reshaping placeholders, numpy arrays and also for generating random numbers from known and unknown distributions. Like for initializing the weights of a neural network.
  4. nltk – A suit of packages, datasets and lexicons that is a must for any NLP(Natural Language Processing) tasks. Some important functionalities are : tokenization, Parts of Speech(POS) tagging, n-gram language models, stemming, lemmatizing and stopwords
  5. matplotlib – For data visualization, specially while working with pandas. Useful for making complicated plots and graphs.
  6. re – For using regular expressions or regex. Proves very useful for preprocessing and cleaning data. Like for removing punctuation marks.
  7. pickle – This is a very useful tool, when we need to store some output data, but exporting it to formats like csv, txt, json or xml, would take very large amounts of space. Pickle dumps the data in the form of a Python executable, which can be imported into the Python code in the same format.
  8. theano – Working with Theano for implementing Deep Learning models.
  9. tensorflow, tensorflow-gpu – Working with Tensorflow for implementing Deep Learning models.
  10. os – This is very very useful.