How to hire for a Data Scientist?


A data scientist is someone who makes value out of data. Such a person proactively fetches information from various sources and analyzes it for better understanding about how the business performs and builds AI tools that automate certain processes within the company. For example: Consider any ecommerce business. Different departments play vital role in running a business. Such as procurement department, delivery department, Finance department, product department etc., They decide on purchasing product based on the sales trend and cost trend. when a data scientist is given previous data of the product such as Cost, production, quantity and quality variations, he analysis the data and predict the future for the sale of the product and cost of the product.  


Standard Job Description:

There are many definitions of this job, and it is sometimes mixed with the Big Data engineer occupation. A data scientist or engineer may be X% scientist, Y% software engineer, and Z% hacker, which is why the definition of the job becomes convoluted. The actual ratios vary depending on
the skills required and type of job. Usually, it’s considered normal to bring people with different sets of skills into the data science team. However, Data Engineers are focused on building infrastructure and architecture for data generation. In contrast, data scientists are focused on advanced mathematics and statistical analysis on that generated data.  


 Data scientists utilize their analytical, statistical, and programming skills to collect, analyze, and interpret large data sets. They then use this information to develop data-driven solutions to difficult business challenges. Data scientists commonly have a bachelor’s degree in statistics, math, computer science, or economics. Data scientists have a wide range of technical competencies including statistics and machine learning,
coding languages, databases, machine learning, and reporting technologies.  


 Data scientist duties typically include creating various machine learning-based tools or processes within the company, such as recommendation engines or automated lead scoring systems. People within this role should also be able to perform statistical analysis. 

Key Job Responsibilities: 

1. Work with stakeholders throughout the organization to identify opportunities for leveraging company data to drive business solutions. 

2. Mine and analyze data from company databases to drive optimization and improvement of product development, marketing techniques and
business strategies. 

3. Assess the effectiveness and accuracy of new data sources and data gathering techniques. 

4. Develop custom data models and algorithms to apply to data sets. 

5. Use predictive modeling to increase and optimize customer experiences, revenue generation, ad targeting and other business outcomes. 

6. Develop company A/B testing framework and test model quality. 

7. Coordinate with different functional teams to implement models and monitor outcomes. 

8. Develop processes and tools to monitor and analyze model performance and data accuracy. 

Ideal Candidate:

1. Experience using statistical computer languages (R, Python, SLQ, etc.) to manipulate data and draw insights from large data sets. 

2. Experience working with and creating data architectures. 

3. Knowledge of a variety of machine learning techniques (clustering, decision tree learning, artificial neural networks, etc.) and their real-world advantages/drawbacks. 

4. Knowledge of advanced statistical techniques and concepts (regression, properties of distributions, statistical tests and proper usage, etc.) and experience with applications. 

5. Excellent written and verbal communication skills for coordinating across teams. 

6. A drive to learn and master new technologies and techniques. 

Desired Education:

Bachelor’s / Master’s / PHD in Statistics, Mathematics, Computer Science or another quantitative field. 

Certifications Associated: 

1. Certified Analytics Professional (CAP) 

2. Cloudera Certified Associate: Data Analyst 

3. Cloudera Certified Professional: CCP Data Engineer 

4. Data Science Council of America (DASCA) Senior Data Scientist (SDS) 

5. Data Science Council of America (DASCA) Principle Data Scientist (PDS) 

6. Dell EMC Data Science Track 

7. Google Certified Professional Data Engineer 

8. Google Data and Machine Learning 

9. IBM Data Science Professional Certificate 

10. Microsoft MCSE: Data Management and Analytics 

11. Microsoft Certified Azure Data Scientist Associate 

12. Open Certified Data Scientist (Open CDS) 

13. SAS Certified Advanced Analytics Professional 

14. SAS Certified Big Data Professional 

15. SAS Certified Data Scientist 

Key Skills:

Data Analyst, Data Science, Data Analysis, Data Mining, Machine Learning, Data Visualization, Data Modeling, Big Data, Data Visualization, Deep Learning, Python, R, ETL, AWS, Power Bi, R, Python, Regression, Statistical Modelling. 

Common Positions: 

1. Data Scientist 

2. Data Analyst / Architect / Engineer 

3. AI Architect 

4. Machine Learning Engineer 

5. Marketing / Operations Analyst 


Screening Questions/Assessment Parameters:

1. Coding knowledge and experience with several languages: C, C++, Java, JavaScript, etc. 

2. Knowledge and experience in statistical and data mining techniques: GLM/Regression, Random Forest, Boosting, Trees, text mining, social
network analysis, etc. 

3. Experience querying databases and using statistical computer languages: R, Python, SLQ, etc. 

4. Experience using web services: Redshift, S3, Spark, DigitalOcean, etc. 

5. Experience creating and using advanced machine learning algorithms and statistics: regression, simulation, scenario analysis, modeling,
clustering, decision trees, neural networks, etc. 

6. Experience analyzing data from 3rd party providers: Google Analytics, Site Catalyst, Core metrics, AdWords, Crimson Hexagon, Facebook Insights, etc. 

7. Experience with distributed data/computing tools: Map/Reduce, Hadoop, Hive, Spark, Gurobi, MySQL, etc. 

8. Experience visualizing/presenting data for stakeholders using: Periscope, Business Objects, D3, ggplot, etc. 

Basic Terminologies:

1. Algorithm. Repeatable sets of instructions which people or machines can use to process data. 

2. Bayes Theorem. A mathematical formula used to predict the probability of one event occurring in relation to whether another event has

3. Data Mining. The process of examining a set of data to determine relationships between variables which could affect outcomes – generally at large scale and by machines. 

4. Data Set. The entire collection of data that will be used in a data science initiative. 

5. Metadata. Data about data, or data attached to other data – for example with an image file this would be information about its size, when
it was created, what camera was used to take it, or which version of a software package it was created in. 

6. Outlier. A variable where the value is very different from that which is expected considering the value of other variables in the dataset. These can be indicators of rare or unexpected events, or of unreliable data. 

7. Predictive Modelling. Using data to predict the future. 

8. Quantile. A group of objects which have been classified according to similar characteristics, and then distributed evenly between several
such groups.  

9. Standard Deviation. A common calculation in data science used to measure how far removed a variable, statistic or measurement is from
the average. 

10. Data visualization. Any attempt to make data more easily digestible by rendering it in a visual context. 

11. Big Data: Big Data name itself interprets that the context come in where there are huge and complex data sets. 

Industry Jargons:

1. Anonymization. When carrying out scientific data analysis using personal data (data which identifies a person), anonymization refers to the process of removing or obfuscating indicators in the data which show who it specifically refers to. 

2. Behavioral analytics. The use of data on a person or object’s behavior to make predictions on how it might change in the future or determining variables which affect it, so more favorable or efficient outcomes might be achieved. 

3. Classification. The ability to use data (about an object, event or anything else) to determine which of several predetermined groups an item belongs in. For a basic example, an image recognition analysis might classify all shapes with four equal sides as squares, and all shapes with three sides as triangles. 

4. Clustering. Clustering is also about grouping objects together, but it differs because it is used when there are no predetermined groups. Objects (or events) are clustered together due to similarities they share, and algorithms determine what that common relationship between them may be. 

5. Decision Trees. A basic decision-making structure which can be used by a computer to understand and classify information. 

6. Random Forest. A random forest is a method of statistical analysis which involves taking the output of many decision trees (see above) and analyzing them together, to provide a more complex and detailed understanding or classification of data than would be possible with just one tree. 

7. Data wrangling. The process of formatting or restructuring raw data to suit specific needs or increase its decision-making power (sometimes referred to as data munging). 

8. Deep learning. A branch of machine learning that attempts to mirror the neurons and neural networks associated with thinking in human beings. 

9. Supervised learning. A common branch of machine learning in which a data scientist trains the algorithm to draw what he or she believes to be the correct conclusions. 

10. Unsupervised learning. A branch of machine learning where the algorithm does not rely on human input, and is, instead, self-learning. This more closely resembles what some experts call true artificial intelligence. 

11. Spark: It is an open source distributed platform to computing engine used for processing and analyzing a large amount of data. 

12. No SQL Database: Most scalable data base where you can add any kind of data.It can handle large amounts of data. 

13. Hadoop: It is a distributed file system used for storing and processing big data. 

14. Database: Any collection of data, or information, that is specially organized for rapid search and retrieval by a computer. 

Benchmark Profile:

Benchmark Profile on LinkedIn (1)

Benchmark Profile on LinkedIn (2)

Benchmark Profile on LinkedIn (3)

Benchmark Profile on RMS(1) 

Benchmark Profile on RMS(2)

Benchmark Profile on RMS(3)