How to Ace a Data Science Interview As I mentioned in my first post , I have just finished an extensive tech job search, which featured eight on-sites, on-sites, along with countless phone screens and informal chats. I was interviewing for a combination of data science and software engineering (machine learning) positions, and I got a pretty good sense of what those interviews interviews are lie. In this post, I give an overview of what you should expect ex pect in a data science interview, and some suggestions for how to prepare. prepare. An interview is not a pop quiz. q uiz. You You should know what to expect going in, and you can take the time to prepare for it. During the interview phase o f the process, your recruiter is on your side and can usually tell you what types of interviews you’ll have. ven if the recruiter is reluctant to share that, common practices in the industry are a good guide to what you’re likely to see. !n this post, !’ll go over the types of data science interviews !’ve encountered, and offer my advice on how to prepare for them. Data science roles gene rally fall into two "road ares of focus# statistics and machine learning. ! only applied to the latter category, so that’s that’s the type of position discussed in this post. $y experience is also limited limited to tech companies, so ! can’t offer guidance for data science in finance, "iotech, etc.. %ere are the types of interviews &or parts of interviews' !’ve come across. Always# •
Coding (usually whiteboard)
•
Applied machine learning
•
Your Your background
(ften# •
Culture t
•
!achine learning theory
•
Dataset analysis
•
Stats
You You will encounter a similar set of interviews for a machine learning software engineering position, though more of the questions will fall in the coding category. category.
Coding (usually whiteboard whiteboard)) )his is the same type of interview you’d have for any software engineering position, though the expectations may "e less stringent. )here are lots of we"sites we"sites and and "ooks "ooks that that will tell you how to prepare. *ractice your coding skills if they’re rusty. rusty. Don’t forget forget to practice coding away from the computer &e.g. on paper', which is surely a skill that’s rusty. +eview the data structures you may never have used outside of school "inary search trees, linked lists, heaps. -e comforta"le with recursion. now how to reason a"out algorithm running times. times. You You can generally use any /real0 language you want in an interview &$atla" doesn’t count, unfortunately'1 *ython’s succinct syntax makes it a great language for coding interviews. interviews.
*rep tips# •
•
I" you get nervous in interviews# try doing some practice problems under time pressure$ I" you don%t have much so"tware engineering e&perience# see i" you can get a "riend to look over your practice code and provide "eedback$
During the interview# •
•
•
•
•
!ake sure you understand e&actly what problem you%re trying to solve$ Ask the interviewer 'uestions i" anything is unclear or underspecied$ !ake sure you e&plain your plan to the interviewer be"ore you start writing any code# so that they can help you avoid spending time going down less thanideal paths$ I" you can%t think o" a good way to do something# it o"ten helps to start by talking through a dumb way to do it$ !ention what invalid inputs you%d want to check "or (e$g$ input variable type check)$ Don%t bother writing the code to do so unless the interviewer asks$ In all my interviews# nobody has ever asked$ e"ore declaring that your code is nished# think about variable initiali*ation# end conditions# and boundary cases (e$g$ empty inputs)$ I" it seems help"ul# run through an e&le$ You%ll You%ll score points by catching your bugs yoursel"# rather than having the interviewer point them out$
Applied machine learning All the applied machine learning interviews !’ve had focused on supervised learning. learning. )he interviewer will present you with a prediction pro"lem, and a sk you to explain how you would set up an algorithm to make that prediction. )he pro"lem selected is often relevant to the company you’re interviewing at &e.g. figuring out which product to recommend to a user, which users are going to stop using the site, which ad to display, etc.', "ut can also "e a toy example
&e.g. recommending "oard games to a friend'. )his type of interview doesn’t depend on much "ackground knowledge, other than having a general understanding of machine learning concepts &see "elow'. %owever, it definitely helps to prepare "y "rainstorming the types o f pro"lems a particular company might ask you to solve. ven if you miss the mark, the "rainstorming session will help with with the culture fit interview &also see "elow'. 2hen answering this type of question, !’ve found it helpful to start "y laying out the setup of the pro"lem. 2hat are the inputs3 2hat are the the la"els you’re trying to predict3 2hat machine learning algorithms could you run on the data3 4ometimes the setup will "e o"vious from the question, "ut sometimes you’ll need to figure out h ow to define the pro"lem. !n the latter case, you’ll generally have a discussion with the interviewer a"out some plausi"le definitions &e.g., what does it mean for a user to /stop using the site03'. )he main component of your answer will "e feature engineering. )here is nothing magical a" out "rainstorming features. )hink a"out what might "e predictive of the varia"le varia"le you are trying to predict, and what information you would actually have availa"le. !’ve found it helpful to give context around what !’m trying to capture, and to what extent the features !’m proposing reflect that information. 5or the sake of concreteness, here’s an example. 4uppose Amazon is trying to figure out what "ooks to recommend to you. &6ote# ! did not interview at Amazon, Amazon, and have no idea what they actually ask in their the ir interviews.' )o predict predict what "ooks you’re likely to "uy, "u y, Amazon can look for "ooks that are similar to your past Amazon purchases. -ut may"e some purchases were mistakes, and you vowed to never "uy a "ook like that again. 2ell, 2ell, Amazon knows how you’ve interacted with your indle "ooks. !f there’s a "ook you started "ut never finished, it might "e a positive signal for general areas you’re interested in, "ut a negative signal for the particular author. (r may"e some categories of "ooks deserve different treatment. 5or example, if a year ago you were "uying "ooks targeted at one7year7olds, Amazon could deduce that nowadays you’re looking for "ooks for two7year7olds. !t’s easy to see how you can spend a while exploring the space "etween what you’d like to know and what you can actually find out. Your background Your background You You should "e prepared to give a high7level summary of your career, as well as to do a deep7dive into a pro8ect you’ve worked on. )he pro8ect doesn’t have to "e directly related to the position you’re interviewing for &though it can’t hurt', "ut it needs to "e the kind of work you can have an in7depth technical discussion a"out.
)o prepare# •
+eview any papers,presentations that came out o" your pro-ects to re"resh your mind on the technical details$
•
•
.ractice e&plaining your pro-ect to a "riend in order to make sure you are telling a coherent story$ story$ /eep in mind that you%ll probably be talking talkin g to someone who%s smart but doesn%t have e&pertise in your particular eld$ e prepared to answer 'uestions as to why you chose the approach that you did# and about your individual contribution to the pro-ect$
Culture ft %ere are some culture fit questions your interviewers are likely to "e interested in. )hese questions might come up as part of other interviews, and will likely likely "e asked indirectly. !t helps to keep what the interviewer is looking for in the "ack of your mind. •
Are you specifcally interested in the product/company/space you’d be working in? It helps to prepare by thinking about the problems the
company is trying to solve# and how you and the team you%d be part o" could make a di0erence$ •
Do you care about impact? 1ven in a researchoriented corporate
environment# I wouldn%t recommend saying that you don%t care about company metrics# and that you%d love to -ust play with data and write papers$ •
ill you work well with other people? I know it%s a clich2# but most work
is collaborative# and companies are trying to assess this as best they can$ Avoid badmouthing "ormer colleagues# and show appreciation "or their contributions to your pro-ects$ •
Are you willing to get your hands dirty? I" there%s annoying work that
needs to be done (e$g$ cleaning up messy data)# will you take care o" it3 •
Are you someone the team will be happy to ha!e around on a personal le!el? 1ven though you might be stressed# try to be "riendly#
positive# enthusiastic and genuine throughout the interview process$ You You may also get "road questions a"out what kinds of work you en8oy and what motivates you. !t’s useful to have an answer ready, read y, "ut there may not "e a /right0 answer the interviewer is looking for. "achine learning theory )his type of interview will test your understanding of "asic machine learning concepts, generally with a focus on supervised learning. learning. You should understand# •
•
4he general setup "or a supervised learning system 5hy you want to split data into training and test sets
•
•
4he idea that models that aren%t power"ul enough can%t capture the right generali*ations about the data# and ways to address this (e$g$ di0erent model or pro-ection into a higherdimensional space) 4he idea that models that are too power"ul su0er "rom overtting# and ways to address this (e$g$ regulari*ation)
You don’t need to know a lot of machine learning algorithms, "ut you definitely need to understand logistic regression, which seems to "e what most companies are using. ! also had some in7depth discussions of 49$s, "ut that may 8ust "e "ecause ! "rought them up. Dataset analysis !n this type of interview, you will "e given a data set, and asked to write a script to pull out features for some prediction task. You may "e asked to then plug the features into a machine learning algorithm. )his interview essentially adds an implementation component to the applied machine learning interview &see a"ove'. (f course, your features may now "e inspired "y what you see in the data. Do the distri"utions for each feature you’re considering differ "etween the la"els you’re trying to predict3
! found these interviews hardest to prepare for, "ecause the recruiter often wouldn’t tell me what format the data would "e in, and what exactly !’d need to do with it. &5or example, do ! need to review *ython’s csv import module3 4hould ! look over the syntax for training a model in scikit7 learn3' ! also had one recruiter tell me !’d "e analyzing /"ig data0, which was a "it intimidating &am ! going to "e working with distri"uted data"ases or something3' until ! discovered at the interview that the /"ig data0 set had all of ::,;;; examples. ! encourage you to push for as much info as possi"le a"out what you’ll actually "e doing. !f you plan to use *ython, working through the scikit7learn tutorial is a good way to prepare. #tats ! have a decent intuitive understanding of statistics, "ut very little formal knowledge. $ost of the time, this sufficed, though !’m sure knowing more wouldn’t have hurt. You should understand how to set up an A<- test, including random sampling, confounding varia"les, summary statistics &e.g. mean', and measuring statistical significance. $reparation Checklist % &esources %ere is a summary list of tips for preparing for data science interviews, along with a few helpful resources.
6$ Coding (usually whiteboard) o
7et com"ortable with basic algorithms# data structures and guring out algorithm comple&ity$
o
.ractice writing code away "rom the computer in your programming language o" choice$
o
Resources:
.retty e&haustive list o" what you might encounter in an interview !any interview prep books# e$g$ Cracking the Coding Interview
8$ Applied machine learning o
4hink about the machine learning problems that are relevant "or each company you%re interviewing at$ 9se these problems as practice 'uestions$
:$ Your background o o
4hink through how to summari*e your e&perience$ .repare to give an indepth technical e&planation o" a pro-ect you%ve worked on$ 4ry it out on a "riend$
;$ Culture t o
4hink about the problems each company is trying to solve# and how you and the team you%d be part o" could make a di0erence$
o
e prepared to answer broad 'uestions about what kind o" work you en-oy and what motivates you$
<$ !achine learning theory o
9nderstand machine learning concepts on an intuitive level# "ocusing especially on supervised learning$
o
=earn the math behind logistic regression$
o
Resources:
4he Shape o" Data blog provides a nice intuitive overview$ A >ew 9se"ul 4hings to /now about !achine =earning 4o really go in depth# check out Andrew ?g%s Stan"ord machine learning course on Coursera or @penClassroom$
$ Dataset analysis
o
7et com"ortable with a set o" technical tools "or working with data$
o
Resources:
I" you plan to use .ython# work through the scikitlearn tutorial (you could skip section 8$;)$
B$ Stats o
7et "amiliar with how to set up an A, test$
o
Resources:
uora answer about how to prepare "or interview 'uestions about A, testing How not to run an A, test Sample si*e calculator# which you can use to get some intuition about sample si*es re'uired based on the sensitivity (i$e$ minimal detectable e0ect) and statistical signicance you%re looking "or
4he Interview .rocess 5hat a Company 5ants I have just finished a more extensive tech job search than anyone should really do. It featured eight on-sites, along with countless phone screens and informal chats. !here were a few reasons why I ended up doing things this way" (a) I #uit my job when my husband and I moved from $oston to %an &rancisco a few months ago, so I had the time' (b) I wasnt sure what I was looing for big company vs. small, data scientist vs. software engineer on a machine learning system, etc.' (c) I wasnt sure how well it would all go. !his way of doing a job search turned out to be an awesome learning e xperience. In this series of posts, Ive tried to jot down some thoughts on what maes for a good interview process, both for the company and for the candidate. I was interviewing for a combination of data science and software engineering positions, but many observations should be more broadly applicable.
hat are we trying to do here' anyway? -efore we can talk a"out what is a good or "ad interview process, we need to understand the company’s o"8ectives. %ere are some things your company might "e trying to do, or perhaps should "e trying to do. 6ote that !’m focusing on the interview stage here1 there are many separate questions a"out finding
Hire or no hire: Decide whether to give the candidate an ofer.
6$ ualifcation check *igure out whether the candidate is +ualifed ,or the position they applied ,or- 4his is the most basic ob-ective o" the interview process$ 4o check someone%s 'ualications# you rst need to dene what it means to be 'ualied "or the position$ In addition to technical skills# many companies look "or a Eculture tF# which can help maintain the work and social environment at the company G or change it# i" that%s what%s needed$ 8$ $otential check ., the candidate isn’t +ualifed right now' can they become ecellent at this 0ob anyway? Companies have very di0erent philosophies on whether this is a 'uestion they care to ask$ In many cases# there are good reasons to ask it$ I was told a story about someone who was hired as a machine learning e&pert# but soon got e&cited about in"rastructure challenges# and be"ore long became the head o" an in"rastructure team$ At that point# what does it matter precisely what set o" skills he originally came in with# as long as he%s smart and capable o" learning new things3 :$ 1pportunity check ., the candidate isn’t ideally suited to the position they applied ,or' are there other roles in the company where we’d lo!e to ha!e them? !ore than one place I interviewed at came back with an o0er "or a di0erent role "rom the one I applied "or (in my case# Edata scientistF instead o" EengineerF)$ 4hey weren%t advertising "or that -ob# but they were thinking opportunistically$ Leave a good impression. )here are two ma8or components to this.
6$ 2e cool "ake sure the candidate comes away with a positi!e !iew o, the company- .art o" doing this e0ectively is guring out what counts as EcoolF to this particular candidate$ 8$ 2e nice "ake sure the candidate has a positi!e o!erall eperienceDoing this well has an o"vious "enefit when the candidate is qualified# they’ll "e more likely to take the offer. -ut it also has some less o"vious "enefits that apply to all candidates# •
•
•
4he candidate will be more likely to re"er "riends to your company$ I heard about a candidate who was re-ected but went on to recommend two "riends who ended up -oining the company$ 4he candidate will be more positive when discussing your company with their "riends$ It%s a small world$ 1ven i" you don%t want to hire the candidate right now# you might want to hire them in a year$
•
4here is intrinsic merit in being nice to people as they%re going through what is o"ten a stress"ul e&perience$
Feel good doing it : Make sure the inter!iewers have a positive interview experience. As someone on the other side of the fence, this one is harder for me to reason a"out. -ut here are some thoughts on why this is important# •
•
Your employees might be spending a lot o" time interviewing (as much as 6 hours a week during the "all recruiting season)# and you don%t want them to be miserable doing it$ I" the interviewer is grumpy# the candidate will be less likely to think well o" the company (see above)$ @ne o" the companies I interviewed at re'uires interviewers to submit detailed written "eedback# which resulted in them dedicating much o" their attention to typing up my whiteboard code during the interview$ !ore than one interviewer e&pressed their "rustration with the process$ 1ven i" they were pretty happy with their -ob most o" the time# it certainly didn%t come across that way$
In the next post, I’ll take a look at some job postings. Do ou have thoughts on other goals companies should strive !or" #lease comment$k
3et that 0ob at 3oogle
Ive been meaning to write up some tips on interviewing at 7oogle "or a good long time now$ I keep putting it o0# though# because its going to make you mad$ .robably$ >or some statistical denition o" JyouJ# its very likely to upset you$ 5hy3 ecause$$$ well# here# I wrote a little ditty about it Hey man, I don't know that stuff Stevey's talking aboooooout If my boss thinks it's important I'm gonna get fiiiiiiiiiired Oooh yeah baaaby baaaay-beeeeee....
I didnt reali*e this was such a typical reaction back when I rst started writing about interviewing# way back at other companies$ oyohowdy did I nd out in a hurry$ See# it goes like this "e blah blah blah# I like asking 'uestion K in interviews# blah blah blah$$$ You uestion K3 @h man# I havent heard about K since collegeL Ive never needed it "or my -obL He asks that in interviews3 ut that means someone out there thinks
its important to know# and# and$$$ I dont know itL I" they detect my ignorance# not
only will I be summarily red "or incompetence without so much as a thankyou# I will also be unemployable by people who ask 'uestion KL I" people listen to Stevey# that will be everyoneL I will become homeless and destituteL >or not knowing something Ive never needed be"oreL 4his is horribleL I would attack K itsel"# e&cept that I do not want to pick up a book and gure enough out about it to discredit it$ Clearly I must yell a lot about how stupid Stevey is so that nobody will listen to himL "e So in conclusion# blah blah$$$ huh3 Did you say JredJ3 JDestitute3J 5hat are
you talking about3 You AaaaaaauuugghLLL MstabM MstabM MstabM "e 4hats it$ Im never talking about interviewing again$
It doesnt matter what K is# either$ Its arbitrary$ I could say JI really en-oy asking the candidate %their name& in interviewsJ# and people would still "reak out# on account o" insecurity about either interviewing in general or their knowledge o" their own name# hope"ully the "ormer$ ut 4H1?# time passes# and interview candidates come and go# and we always wind up saying J7osh# we sure wish that obviously smart person had prepared a little better "or his or her interviews$ Is there any way we can help "uture candidates out with some tips3J And then nobody actually does anything# because were all a"raid o" getting stabbed violently by .eople 5ho Dont /now K$ I considered giving out a set o" tips in which I actually use variable names like K# rather than real sub-ects# but decided that in the resultant vacuum# everone would get upset$ @therwise that approach seemed pretty good# as long as I published under a pseudonym$ In the end# people really need the tips# regardless o" how many "eelings get hurt along the way$ So rather than skirt around the issues# Im going to give you a "ew mandatory substitutions "or K along with a "air amount o" general interviewprep in"ormation$ Ca!eats and Disclaimers
4his blog is not endorsed by 7oogle$ 7oogle doesnt know Im publishing these tips$ Its -ust between you and me# @/3 Dont tell them I prepped you$ Nust go kick ass on your interviews and well be s'uare$ Im only talking about general so"tware engineering positions# and interviews "or
those positions$ 4hese tips are actually genericO theres nothing specic to 7oogle vs$ any other so"tware company$ I could have been writing these tips about my rst so"tware -ob 8 years ago$ 4hat implies that these tips are also timeless# at least "or the span o" our careers$ 4hese tips obviously wont get you a -ob on their own$ !y hope is that by "ollowing them you will per"orm your very best during the interviews$ 1h' and um' why 3oogle?
@hoL 5hy 7oogle# you ask3 5ell lets -ust have that dialog right up "ront# shall we3 You Should I work at 7oogle3 Is it all they say it is# and more3 5ill I be serenely
happy there3 Should I apply immediately3 "e Yes$ You 4o which 'ues$$$ wait# what do you mean by JYes3J I didnt even say who I amL "e Dude# the answer is Yes$ (You may be a woman# but Im still calling you Dude$) You ut$$$ but$$$ I am paraly*ed by inertiaL And I "eel a certain com"ort level at my
current company# or at least I have become relatively inured to the discom"ort$ I know people here and nobody at 7oogleL I would have to learn 7oogles build system and technology and stu0L I have no credibility# no reputation there P I would have to start over virtually "rom scratchL I waited too long# theres no upsideL Im a"raaaaaaidL "e D9D1$ 4he answer is Yes already# @/3 Its an invariant$ 1veryone else who came to 7oogle was in the exact same position as you are# modulo a hand"ul o"
"amous people with beards that put 7andal"s to shame# but theyre a very tiny minority$ 1veryone who applied had the same reasons "or not applying as you do$ And everyone here says J7@SH# I S9+1 A! HA..Y I CA!1 H1+1LJ So -ust apply already$ ut prep rst$ You ut what i" I get a mistrial3 I might be smart and 'ualied# but "or some
random reason I may do poorly in the interviews and not get an o0erL 4hat would be a huge blow to my egoL I would rather pass up the opportunity altogether than have a chance o" "ailureL "e Yeah# thats at least partly true$ Heck# I kinda didnt make it in on my rst
attempt# but I begged like a street dog until they gave me a second round o"
interviews$ I caught them in a weak moment$ And the second time around# I prepared# and did much better$ 4he thing is# 7oogle has a wellknown "alse negative rate# which means we sometimes turn away 'ualied people# because thats considered better than sometimes hiring un'ualied people$ 4his is actually an industrywide thing# but the dial gets turned di0erently at di0erent companies$ At 7oogle the "alsenegative rate is pretty high$ I dont know what it is# but I do know a lot o" smart# 'ualied people whove not made it through our interviews$ Its a bummer$ ut the really important takeaway is this i! ou don't get an o(er, ou ma still be )uali*ed to work here $ So it neednt be a blow to your ego at allL As "ar as anyone I know can tell# "alse negatives are completely random# and are unrelated to your skills or 'ualications$ 4hey can happen "rom a variety o" "actors# including but not limited to 6$ youre having an o0 day 8$ one or more o" your interviewers is having an o0 day :$ there were communication issues invisible to you and,or one or more o" the interviewers ;$ you got unlucky and got an Interview Anti=oop 1h no' not the .nter!iew Anti45oop6
Yes# Im a"raid you have to worry about this$ 5hat is it# you ask3 5ell# back when I was at Ama*on# we did (and they undoubtedly still do) a =@4 o" soulsearching about this e&act problem$ 5e eventually concluded that every single employee 1 at Ama*on has at least one JInterview Anti=oopJ a set o" other employees S who would not hire 1$ 4he root cause is important "or you to understand when youre going into interviews# so Ill tell you a little about what Ive "ound over the years$ >irst# you cant tell interviewers whats important$ ?ot at any company$ ?ot unless theyre specically asking you "or advice$ You have a very narrow window o" perhaps one year a"ter an engineer graduates "rom college to inculcate them in the art o" interviewing# a"ter which the window closes and they believe they are a Jgood interviewerJ and they dont need to change their 'uestions# their 'uestion styles# their interviewing style# or their "eedback style# ever again$ Its a problem$ ut Ive had my hand bitten enough times that I -ust dont try
anymore$ Second problem every Je&periencedJ interviewer has a set o" pet sub-ects and possibly specic 'uestions that he or she "eels is an accurate gauge o" a candidates abilities$ 4he 'uestion sets "or any two interviewers can be widely di0erent and even entirely nonoverlapping$ A classic e&le "ound everywhere is Interviewer A always asks about CQQ trivia# lesystems# network protocols and discrete math$ Interviewer always asks about Nava trivia# design patterns# unit testing# web "rameworks# and so"tware pro-ect management$ >or any given candidate with both A and on the interview loop# A and are likely to give very di0erent votes$ A and would probably not even hire each other# given a chance# but they both happened to go through interviewer C# who asked them both about data structures# uni& utilities# and processes versus threads# and A and both happened to s'ueak by$ 4hats almost always what happens when you get an o0er "rom a tech company$ You -ust happened to s'ueak by$ ecause o" the inherently Rawed nature o" the interviewing process# its highly likely that someone on the loop will be unimpressed with you# even i" you are Alan 4uring$ 1specially i" youre Alan 4uring# in "act# since it means you obviously dont know CQQ$ 4he bottom line is# i" you go to an interview at an so"tware company# you should plan "or the contingency that you might get genuinely unlucky# and wind up with one or more people "rom your Interview Anti=oop on your interview loop$ I" this happens# you will struggle# then be told that you were not a t at this time# and then you will "eel bad$ Nust as long as you dont "eel metabad# everything is @/$ You should "eel good that you "eel bad a"ter this happens# because hey# it means youre human$ And then you should wait 68 months and reapply$ 4hats pretty much the best solution we (or anyone else I know o") could come up with "or the "alsenegative problem$ 5e wipe the slate clean and start over again$ 4here are lots o" people here who got in on their second or third attempt# and theyre kicking butt$ You can too$ 17' . ,eel better about potentially not getting hired
7oodL So lets get on to those tips# then$ I" youve been "ollowing along ver closely# youll have reali*ed that Im interviewer D$ !eaning that my personal set o" pet 'uestions and topics is -ust my own# and its no better or worse than anyone elses$ So I cant tell you what it is# no matter how
much Id like to# because Ill o0end interviewers A through K who have slightly di0erent working sets$ Instead# I want to prep you "or some general topics that I believe are shared by the ma-ority o" tech interviewers at 7ooglelike companies$ +oughly speaking# this means the company builds a lot o" their own so"tware and does a lot o" distributed computing$ 4here are other techcompany "ootprints# the opposite end o" the spectrum being companies that outsource everything to consultants and try to use as much thirdparty so"tware as possible$ !y tips will be use"ul only to the e&tent that the company resembles 7oogle$ So you might as well make it 7oogle# eh3 >irst# lets talk about nontechnical prep$ 8he arm49p
?obody goes into a bo&ing match cold$ =esson you should bring your bo&ing gloves to the interview$ ?o# wait# sorry# I mean warm up be"orehandL How do you warm up3 asically there is shortterm and longterm warming up# and you should do both$ =ongterm warming up means study and practice "or a week or two be"ore the interview$ You want your mind to be in the general JmodeJ o" problem solving on whiteboards$ I" you can do it on a whiteboard# every other medium (laptop# shared network document# whatever) is a cakewalk$ So plan "or the whiteboard$ Shortterm warming up means get lots o" rest the night be"ore# and then do intense# "astpaced warmups the morning o" the interview$ 4he two best longterm warmups I know o" are 6) #tudy a data4structures and algorithms book $ 5hy3 ecause it is the most likely to help you bee" up on problem identication$ !any interviewers are happy when you understand the broad class o" 'uestion theyre asking without e&planation$ >or instance# i" they ask you about coloring 9$S$ states in di0erent colors# you get ma-or bonus points i" you recogni*e it as a graphcoloring problem# even i" you dont actually remember e&actly how graphcoloring works$ And i" you do remember how it works# then you can probably whip through the answer pretty 'uickly$ So your best bet# interviewprep wise# is to practice the art o" recogni*ing that certain problem classes are best solved with certain algorithms and data structures$
!y absolute "avorite "or this kind o" interview preparation is Steven Skienas 4he Algorithm Design !anual$ !ore than any other book it helped me understand -ust how astonishingly commonplace (and important) graph problems are P they should be part o" every working programmers toolkit$ 4he book also covers basic data structures and sorting algorithms# which is a nice bonus$ ut the gold mine is the second hal" o" the book# which is a sort o" encyclopedia o" 6pagers on *illions o" use"ul problems and various ways to solve them# without too much detail$ Almost every 6pager has a simple picture# making it easy to remember$ 4his is a great way to learn how to identi"y hundreds o" problem types$ @ther interviewers I know recommend Introduction to Algorithms$ Its a true classic and an invaluable resource# but it will probably take you more than 8 weeks to get through it$ ut i" you want to come into your interviews prepped# then consider de"erring your application until youve made your way through that book$ 8) :a!e a ,riend inter!iew you- 4he "riend should ask you a random interview 'uestion# and you should go write it on the board$ You should keep going until it is complete# no matter how tired or la*y you "eel$ Do this as much as you can possibly tolerate$ I didnt do these two types o" preparation be"ore my rst 7oogle interview# and I was absolutely shocked at how bad at whiteboard coding I had become since I had last interviewed seven years prior$ Its hardL And I also had "orgotten a bunch o" algorithms and data structures that I used to know# or at least had heard o"$ 7oing through these e&ercises "or a week prepped me mightily "or my second round o" 7oogle interviews# and I did way# way better$ It made all the di0erence$ As "or shortterm preparation# all you can really do is make sure you are as alert and warmed up as possible$ Dont go in cold$ Solve a "ew problems and read through your study books$ Drink some co0ee it actually helps you think "aster# believe it or not$ !ake sure you spend at least an hour practicing immediately be"ore you walk into the interview$ 4reat it like a sports game or a music recital# or heck# an e&am i" you go in warmed up youll give your best per"ormance$ "ental $rep
SoL Youre a hotshot programmer with a long list o" accomplishments$ 4ime to "orget about all that and "ocus on interview survival$ You should go in humble# openminded# and "ocused$ I" you come across as arrogant# then people will 'uestion whether they want to work
with you$ 4he best way to appear arrogant is to 'uestion the validity o" the interviewers 'uestion P it really ticks them o0# as I pointed out earlier on$ +emember how I said you cant tell an interviewer how to interview3 5ell# thats especiall true i" youre a candidate$ So dont ask Jgosh# are algorithms really all that important3 do you ever need to do that kind o" thing in real li"e3 Ive never had to do that kind o" stu0$J Youll -ust get re-ected# so dont say that kind o" thing$ 4reat every 'uestion as legitimate# even i" you are "rustrated that you dont know the answer$ >eel "ree to ask "or help or hints i" youre stuck$ Some interviewers take points o0 "or that# but occasionally it will get you past some hurdle and give you a good per"ormance on what would have otherwise been a horrible stony hal"hour silence$ Dont say Jchoo choo chooJ when youre JthinkingJ$ Dont try to change the sub-ect and answer a di0erent 'uestion$ Dont try to divert the interviewer "rom asking you a 'uestion by telling war stories$ Dont try to blu0 your interviewer$ You should !ocus on each problem theyre giving you and make your best e0ort to answer it "ully$ Some interviewers will not ask you to write code# but they will expect you to start writing code on the whiteboard at some point during your answer$ 4hey will give you hints but wont necessarily come right out and say JI want you to write some code on the board now$J I" in doubt# you should ask them i" they would like to see code$ Interviewers have vastly di0erent e&pectations about code$ I personally dont care about synta& (unless you write something that could obviously never work in any programming language# at which point I will dive in and veri"y that you are not# in "act# a circus clown and that it was an honest mistake)$ ut some interviewers are really picky about synta and some will even silently mark you down "or missing a semicolon or a curly brace# without telling ou $ I think o" these interviewers as P well# its a technical term that rhymes with Jbass solesJ# but they think o" themselves as brilliant technical evaluators# and theres no way to tell them otherwise$ So ask$ Ask i" they care about synta and i" they do# try to get it right$ =ook over your code care"ully "rom di0erent angles and distances$ .retend its someone elses code and youre tasked with nding bugs in it$ Youd be ama*ed at what you can miss when youre standing 8 "eet "rom a whiteboard with an interviewer staring at your shoulder blades$ Its @/ (and highly encouraged) to ask a "ew clari"ying 'uestions# and occasionally veri"y with the interviewer that youre on the track they want you to be on$ Some
interviewers will mark you down i" you -ust -ump up and start coding# even i! ou get the code right $ 4heyll say you didnt think care"ully rst# and youre one o" those Jlets not do any designJ type cowboys$ So even i" you think you know the answer to the problem# ask some 'uestions and talk about the approach youll take a little be"ore diving in$ @n the Rip side# dont take too long be"ore actually solving the problem# or some interviewers will give you a delayo"game penalty$ 4ry to move (and write) 'uickly# since o"ten interviewers want to get through more than one 'uestion during the interview# and i" you solve the rst one too slowly then theyll be out o" time$ 4heyll mark you down because they couldnt get a "ull picture o" your skills$ 4he benet o" the doubt is rarely given in interviewing$ @ne last nontechnical tip bring your own whiteboard dryerase markers$ 4hey sell pencilthin ones at oce supply stores# whereas most companies (including 7oogle) tend to stock the "at kind$ 4he thin ones turn your whiteboard "rom a ;Ti standard denition tube into a
ont$ Your interviewer will not be impressed$ Amusingly# although it always irks me when people do this# I did it during my interviews# too$ Nust be aware o" itL @h# and dont let the marker dry out while youre standing there waving it$ Im tellin ya you want minimal distractions during the interview# and that one is surprisingly common$ @/# that should be good "or nontech tips$ @n to K# "or some value o" KL Dont stab meL 8ech $rep 8ips
4he best tip is go get a computer science degree$ 4he more computer science you have# the better$ You dont have to have a CS degree# but it helps$ It doesnt have to be an advanced degree# but that helps too$ However# youre probably thinking o" applying to 7oogle a little sooner than 8 to T years "rom now# so here are some shorterterm tips "or you$ Algorithm Compleity you need to know ig@$ Its a must$ I" you struggle with
basic big@ comple&ity analysis# then you are almost guaranteed not to get hired$ Its# like# one chapter in the beginning o" one theory o" computation book# so -ust go read it$ You can do it$
#orting know how to sort$ Dont do bubblesort$ You should know the details o" at
least one nMlog(n) sorting algorithm# pre"erably two (say# 'uicksort and merge sort)$ !erge sort can be highly use"ul in situations where 'uicksort is impractical# so take a look at it$ >or 7ods sake# dont try sorting a linked list during the interview$ :ashtables hashtables are arguably the single most important data structure known to mankind$ You absolutel have to know how the work $ Again# its like one
chapter in one data structures book# so -ust go read about them$ You should be able to implement one using only arrays in your "avorite language# in about the space o" one interview$ 8rees you should know about trees$ Im tellin ya this is basic stu0# and its
embarrassing to bring it up# but some o" you out there dont know basic tree construction# traversal and manipulation algorithms$ You should be "amiliar with binary trees# nary trees# and trietrees at the very ver least$ 4rees are probably the best source o" practice problems "or your longterm warmup e&ercises$ You should be "amiliar with at least one Ravor o" balanced binary tree# whether its a red,black tree# a splay tree or an AU= tree$ You should actually know how its implemented$ You should know about tree traversal algorithms >S and D>S# and know the di0erence between inorder# postorder and preorder$ You might not use trees much daytoday# but i" so# its because youre avoiding tree problems$ You wont need to do that anymore once you know how they work$ Study upL 3raphs
7raphs are# like# really reall important$ !ore than you think$ 1ven i" you already think theyre important# its probably more than you think$ 4here are three basic ways to represent a graph in memory (ob-ects and pointers# matri and ad-acency list)# and you should "amiliari*e yoursel" with each representation and its pros and cons$ You should know the basic graph traversal algorithms breadthrst search and depthrst search$ You should know their computational comple&ity# their tradeo0s# and how to implement them in real code$
You should try to study up on "ancier algorithms# such as Di-kstra and AM# i" you get a chance$ 4heyre really great "or -ust about anything# "rom game programming to distributed computing to you name it$ You should know them$ 5henever someone gives you a problem# think graphs $ 4hey are the most "undamental and Re&ible way o" representing any kind o" a relationship# so its about a << shot that any interesting design problem has a graph involved in it$ !ake absolutely sure you cant think o" a way to solve it using graphs be"ore moving on to other solution types$ 4his tip is importantL 1ther data structures
You should study up on as many other data structures and algorithms as you can t in that big noggin o" yours$ You should especially know about the most "amous classes o" ?.complete problems# such as traveling salesman and the knapsack problem# and be able to recogni*e them when an interviewer asks you them in disguise$ You should nd out what ?.complete means$ asically# hit that data structures book hard# and try to retain as much o" it as you can# and you cant go wrong$ "ath
Some interviewers ask basic discrete math 'uestions$ 4his is more prevalent at 7oogle than at other places Ive been# and I consider it a 7ood 4hing# even though Im not particularly good at discrete math$ 5ere surrounded by counting p roblems# probability problems# and other Discrete !ath 66 situations# and those innumerate among us blithely hack around them without knowing what were doing$ Dont get mad i" the interviewer asks math 'uestions$ Do your best$ Your best will be a heck o" a lot better i" you spend some time be"ore the interview re"reshing your memory on (or teaching yoursel") the essentials o" combinatorics and probability$ You should be "amiliar with nchoosek problems and their ilk P the more the better$ I know# I know# youre short on time$ ut this tip can really help make the di0erence between a Jwere not sureJ and a Jlets hire herJ$ And its actually not all that bad P discrete math doesnt use much o" the highschool math you studied and "orgot$ It starts back with elementaryschool math and builds up "rom there# so you can probably pick up what you need "or interviews in a couple o" days o" intense study$ Sadly# I dont have a good recommendation "or a Discrete !ath book# so i" you do#
please mention it in the comments$ 4hanks$ 1perating #ystems
4his is -ust a plug# "rom me# "or you to know about processes# threads and concurrency issues$ A lot o" interviewers ask about that stu0# and its pretty "undamental# so you should know it$ /now about locks and mute&es and semaphores and monitors and how they work$ /now about deadlock and livelock and how to avoid them$ /now what resources a processes needs# and a thread needs# and how conte&t switching works# and how its initiated by the operating system and underlying hardware$ /now a little about scheduling$ 4he world is rapidly moving towards multicore# and youll be a dinosaur in a real hurry i" you dont understand the "undamentals o" JmodernJ (which is to say# Jkinda brokenJ) concurrency constructs$ 4he best# most practical book Ive ever personally read on the sub-ect is Doug =eas Concurrent .rogramming in Nava$ It got me the most bang per page$ 4here are obviously lots o" other books on concurrency$ Id avoid the academic ones and "ocus on the practical stu0# since its most likely to get asked in interviews$ Coding
You should know at least one programming language really well# and it should pre!erabl be CQQ or Nava$ CV is @/ too# since its pretty similar to Nava$ You will be e&pected to write some code in at least some o" your interviews$ You will be e&pected to know a "air amount o" detail about your "avorite programming language$ 1ther #tu;
ecause o" the rules I outlined above# its still possible that youll get Interviewer A# and none o" the stu0 youve studied "rom these tips will be directly use"ul (e&cept being warmed up$) I" so# -ust do your best$ 5orst case# you can always come back in 68 months# right3 !ight seem like a long time# but I assure you it will go by in a Rash$ 4he stu0 Ive covered is actually mostly redRags stu0 that really worries people i" you dont know it$ 4he discrete math is potentially optional# but somewhat risky i" you dont know the rst thing about it$ 1verything else Ive mentioned you should know cold# and then youll at least be prepped "or the baseline interview level$ It could be a lot harder than that# depending on the interviewer# or it could be easy$ It -ust depends on how lucky you are$ Are you "eeling lucky3 4hen give it a tryL
#end me your resume
Ill probably batch up any resume submissions people send me and submit them weekly$ In the meantime# study upL You have a lot o" warming up to do$ +ealworld work makes you rusty$ I hope this was help"ul$ =et the Rames begin# etc$ Yawn$ <6< A!# Nuly 6W# 866
4op Data Science Interview uestions P !ost Asked
%ere are top =; o"8ective type sample Data 4cience !nterview questions and their an swers are given 8ust "elow to them. )hese sample questions are framed "y experts from !ntellipaat who trains for Data 4cience training to give you an idea of type of questions which may "e asked in interview. 2e have taken full care to give correct answers for all the questions. Do comment your thoughts %appy >o" %unting?
8op Answers to Data #cience .nter!iew uestions 6$5hat do you mean by word Data Science3 Data Science is the e&traction o" knowledge "rom large volumes o" data that are structured or unstructured# which is a continuation o" the eld data mining and predictive analytics# It is also known as knowledge discovery and data mining$ 8$1&plain the term botnet3 A botnet is a a type o" bot running on an I+C network that has been created with a 4ro-an$ :$5hat is Data Uisuali*ation3 Data visuali*ation is a common term that describes any e0ort to help people understand the signicance o" data by placing it in a visual conte&t$ ;$How you can dene Data cleaning as a critical part o" process3 Cleaning up data to the point where you can work with it is a huge amount o" work$ I" we%re trying to reconcile a lot o" sources o" data that we don%t control like in this Right# it can take TX o" our time$
<$.oint out B 5ays how Data Scientists use Statistics3 6$ Design and interpret e&periments to in"orm product decisions$ 8$ uild models that predict signal# not noise$ :$ 4urn big data a into the big picture ;$ 9nderstand user retention# engagement# conversion# and leads$ <$ 7ive your users what they want$ $ 1stimate intelligently$ B$ 4ell the story with the data$ $Di0erentiate between Data modeling and Database design3 Data !odeling P Data modeling (or modeling) in so"tware engineering is the process o" creating a data model "or an in"ormation system by applying "ormal data modeling techni'ues$ Database Design Database design is the system o" producing a detailed data model o" a database$ 4he term database design can be used to describe many di0erent parts o" the design o" an overall database system$ B$Describe in brie" the data Science .rocess Rowchart3 6$Data is collected "rom sensors in the environment$ 8$ Data is EcleanedF or it can process to produce a data set (typically a data table) usable "or processing$ :$ 1&ploratory data analysis and statistical modeling may be per"ormed$ ;$ A data product is a program such as retailers use to in"orm new purchases based on purchase history$ It may also create data and "eed it back into the environment$ T$ 5hat do you understand by term hash table collisions3 Hash table (hash map) is a kind o" data structure used to implement an associative array# a structure that can map keys to values$ Ideally# the hash "unction will assign each key to a uni'ue bucket# but sometimes it is possible that two keys will generate an identical hash causing both keys to point to the same bucket$ It is known as hash collisions$ W$Compare and contrast + and SAS3 SAS is commercial so"tware whereas + is "ree source and can be downloaded by anyone$ SAS is easy to learn and provide easy option "or people who already know S= whereas + is a low level programming language and hence simple procedures takes longer codes$ 6$5hat do you understand by letter +%3
+ is a low level language and environment "or statistical computing and graphics$ It is a 7?9 pro-ect which is similar to the S language and environment which was developed at 1==$ 66$5hat all things + environment includes3 6$ A suite o" operators "or calculations on arrays# in particular matrices# 8$ An e0ective data handling and storage "acility# :$ A large# coherent# integrated collection o" intermediate tools "or data analysis# an e0ective data handling and storage "acility# ;$ 7raphical "acilities "or data analysis and display either onscreen or on hardcopy# and <$ A welldeveloped# simple and e0ective programming language which includes conditionals# loops# userdened recursive "unctions and input and output "acilities$ 68$5hat are the applied !achine =earning .rocess Steps3 6$ .roblem Denition 9nderstand and clearly describe the problem that is being solved$ 8$ Analy*e Data 9nderstand the in"ormation available that will be used to develop a model$ :$ .repare Data Dene and e&pose the structure in the dataset$ ;$ 1valuate Algorithms Develop robust test harness and baseline accuracy "rom which to improve and spot check algorithms$ <$ Improve +esults Improve results to develop more accurate models$ $ .resent +esults Details the problem and solution so that it can be understood by third parties$ 6:$Compare !ultivariate# 9nivariate and ivariate analysis3 !9=4IUA+IA41 !ultivariate analysis "ocuses on the results o" observations o" many di0erent variables "or a number o" ob-ects$ 9?IUA+IA41 9nivariate analysis is perhaps the simplest "orm o" statistical analysis$ =ike other "orms o" statistics# it can be in"erential or descriptive$ 4he key "act is that only one variable is involved$ IUA+IA41 ivariate analysis is one o" the simplest "orms o" 'uantitative (statistical) analysis$ It involves the analysis o" two variables (o"ten denoted as K# Y)# "or the purpose o" determining the empirical relationship between them$ 6;$5hat is Hypothesis in !achine =earning3 4he hypothesis space used by a machine learning system is the set o" all hypotheses that might possibly be returned by it$ It is typically dened by a hypothesis language# possibly in con-unction with a language bias$
6<$Di0erentiate between 9ni"orm and Skewed Distribution3 9?I>@+! DIS4+I94I@? A uni"orm distribution# sometimes also known as a rectangular distribution# is a distribution that has constant probability$ 4he latter o" which simplies to the e&pected "or $ 4he continuous distribution is implemented as 9ni"orm Distribution S/151D DIS4+I94I@? In probability theory and statistics# Skewness is a measure o" the asymmetry o" the probability distribution o" a realvalued random variable about its mean$ 4he skewness value can be positive or negative# or even undened$ 4he 'ualitative interpretation o" the skew is complicated$ 6$5hat do you understand by term 4rans"ormation in Data Ac'uisition3 4he trans"ormation process allows you to consolidate# cleanse# and integrate data$ 5e can semantically arrange the data "rom heterogeneous sources$ 6B$5hat do you understand by term ?ormal Distribution3 It is a "unction which shows the distribution o" many random variables as a symmetrical bellshaped graph$ 6T$5hat is Data Ac'uisition3 It is the process o" measuring an electrical or physical phenomenon such as voltage# current# temperature# pressure# or sound with a computer$ A DA system comprises o" sensors# DA measurement hardware# and a computer with programmable so"tware$ 6W$5hat is Data Collection3 Data collection is the process o" collecting and measuring in"ormation on variables o" interest# in a proper systematic "ashion that enables one to answer stated research 'uestions hypotheses# and revise outcomes$ 8$5hat do you understand by term 9se case3 A use case is a methodology used in system analysis to identi"y# clari"y# and organi*e system re'uirements$ 4he use case consists o" a set o" possible se'uences o" interactions between systems and users in a particular environment and related to a dened particular goal$ 86$5hat is Sampling and Sampling Distribution3 SA!.=I?7 Sampling is the process o" choosing units (e& people# organi*ations) "rom a population o" interest so that by studying the sample we can "airly generali*e our results back to the population "rom which they were chosen$
SA!.=I?7 DIS4+I94I@? 4he sampling distribution o" a statistic is the distribution o" that statistic# considered as a random variable# when derived "rom a random sample o" si*e n$ It may be considered as the distribution o" the statistic "or all possible samples "rom the same population o" a given si*e$ 88$5hat is =inear +egression3 In statistics# linear regression is an way "or modeling the relationship between a scalar dependent variable y and one or more e&planatory variables (or independent variable) denoted by K$ 4he case o" one e&planatory variable is known as simple linear regression$ 8:$Di0erentiate between 1&trapolation and Interpolation3 1&trapolation is an appro&imate o" a value based on e&tending a known se'uence o" values or "acts beyond the area that is certainly known$ Interpolation is an estimation o" a value within two known values in a list o" values$ 8;$How e&pected value is di0erent "rom !ean value3 4here is no di0erence$ 4hese are two names "or the same thing$ 4hey are mostly used in di0erent conte&ts# though i" we talk about the e&pected value o" a random variable and the mean o" a sample# population or probability distribution$ 8<$Di0erentiate between Systematic and Cluster Sampling3 SYS41!A4IC SA!.=I?7 Systematic sampling is a statistical methology involving the selection o" elements "rom an ordered sampling "rame$ 4he most common "orm o" systematic sampling is an e'ualprobability method$ C=9S41+ SA!.=I?7 A cluster sample is a probability sample by which each sampling unit is a collection# or cluster# o" elements$ 8$5hat are the advantages o" Systematic Sampling3 6$1asier to per"orm in the eld# especially i" a proper "rame is not available$ 8$ +egularly provides more in"ormation per unit cost than simple random sampling# in the sense o" smaller variances$ 8B$5hat do you understand by term 4hreshold limit value3 4he threshold limit value (4=U) o" a chemical substance is a level in which it is believed that a worker can be e&posed day a"ter day "or a working li"etime without a0ecting his,her health$ 8T$Di0erentiate between Ualidation Set and 4est set3
Ualidation set It is a set o" e&les used to tune the parameters Zi$e$# architecture# not weights[ o" a classier# "or e&le to choose the number o" hidden units in a neural network$ 4est set A set o" e&les used only to assess the per"ormance Zgenerali*ation[ o" a "ully specied classier$ 8W$How can + and Hadoop be used together3 4he most common way to link + and Hadoop is to use HD>S (potentially managed by Hive or Hase) as the longterm store "or all data# and use !ap +educe -obs (potentially submitted "rom Hive# .ig# or @o*ie) to encode# enrich# and sample data sets "rom HD>S into +$ Data analysts can then per"orm comple& modeling e&ercises on a subset o" prepared data in +$ :$5hat do you understand by term +I!.A=A3 +Impalapackage contains the + "unctions re'uired to connect# e&ecute 'ueries and retrieve back results "rom Impala$ It uses the rNava package to create a NDC connection to any o" the impala servers running on a Hadoop Cluster$ :6$5hat is Collaborative >iltering3 Collaborative ltering (C>) is a method used by some recommender systems$ It consists o" two senses# a narrow one and a more general one$ In general# collaborative ltering is the process o" ltering "or in"ormation or patterns using techni'ues involving collaboration among multiple agents# viewpoints# data sources$ :8$5hat are the challenges o" Collaborative >iltering3 6$ Scalability 8$ Data sparsity :$ Synonyms ;$ 7rey sheep Data sparsity <$ Shilling attacks $ Diversity and the =ong 4ail ::$5hat do you understand by ig data3 ig data is a bu**word# or catchphrase# which describe a massive volume o" both structured and unstructured data that is so large which is dicult to process using traditional database and so"tware techni'ues$ :;$5hat do you understand by !atri& "actori*ation3
!atri& "actori*ation is simply a mathematical tool "or playing around with matrices# and is there"ore applicable in many scenarios by which one would nd out something hidden under the data$ :<$5hat do you understand by term Singular Ualue Decomposition3 In linear algebra# the singular value decomposition (SUD) is a "actori*ation o" a real or comple& matri&$ It has many use"ul applications in signal processing and statistics$ :$5hat do you mean by +ecommender systems3 +ecommender systems or recommendation systems (sometimes replacing EsystemF with a synonym such as plat"orm or engine) are a subclass o" in"ormation ltering system that seek to predict the rating% or pre"erence% that a user would give to an item$ :B$5hat are the applications o" +ecommender Systems3 +ecommender systems have become e&tremely common in recent years# and are applied in a variety o" applications$ 4he most popular ones are probably movies# music# news# books# research articles# search 'ueries# social tags# and products in general$ :T$5hat are the two ways o" +ecommender System3 +ecommender systems typically produce a list o" recommendations in one o" two ways 4hrough collaborative or contentbased ltering$ Collaborative ltering approaches building a model "rom a user%s past behavior (items previously purchased or selected and,or numerical ratings given to those items) as well as similar decisions made by other users$ 4his model is then used to predict items (or ratings "or items) that the user may have an interest in$ Contentbased ltering approaches utili*e a series o" discrete characteristics o" an item in order to recommend additional items with similar properties$ :W$5hat are the "actors to nd the most accurate recommendation algorithms3 6$ Diversity 8$ +ecommender .ersistence :$ .rivacy ;$ 9ser Demographics <$ +obustness $ Serendipity B$ 4rust T$ =abeling
;$5hat is /?earest ?eighbor3 k?? is a type o" instancebased learning# or la*y learning# where the "unction is only appro&imated locally and all computation is de"erred until classication$ 4he k?? algorithm is among the simplest o" all machine learning algorithms$ ;6$5hat is Hori*ontal Slicing3 In hori*ontal slicing# pro-ects are broken up roughly along architectural lines$ 4hat is there would be one team "or 9I# one team "or business logic and services (S@A)# and another team "or data$ ;8$5hat are the advantages o" vertical slicing3 4he advantage o" slicing vertically is you are more ecient$ You don%t have the overhead# and e0ort that comes "rom trying to coordinate activities across multiple teams$ ?o need to negotiate "or resources$ You%re all on the same team$ ;:$5hat is null hypothesis3 In in"erential statistics the null hypothesis usually re"ers to a general statement or de"ault position that there is no relationship between two measured phenomena# or no di0erence among groups$ ;;$5hat is Statistical hypothesis3 In statistical hypothesis testing# the alternative hypothesis (or maintained hypothesis or research hypothesis) and the null hypothesis are the two rival hypotheses which are compared by a statistical hypothesis test$ ;<$5hat is per"ormance measure3 .er"ormance measurement is the method o" collecting# analy*ing and,or reporting in"ormation regarding the per"ormance o" an individual# group# organi*ation# system or component$ ;$5hat is the use o" tree command3 4his command is used to list contents o" directories in a treelike "ormat$ ;B$5hat is the use o" uni' command3 4his command is used to report or omit repeated lines$ ;T$5hich command is used translate or delete characters3 tr command is used translate or delete characters$
;W$5hat is the use o" tapkee command3 4his command is used to reduce dimensionality o" a data set using various algorithms$ <$5hich command is used to sort the lines o" te&t les3 sort command is used to sort the lines o" te&t les$ 6 Data Science in .ython Interview uestions and Answers "or 86 : Dec 86< *ython’s growing adoption in data science has pitched it as a competitor to + programming language. 2ith its various li"raries maturing over time to suit all data science needs, a lot of people are shifting towards *ython from +. )his might seem like the logical scenario. -ut + would still come out as the popular choice for data scientists. *eople are shifting towards *ython "ut not as many as to disregard + altogether. 2e have highlighted the pros and cons of "oth these languages used in Data 4cience in our *ython vs + article. !t can "e seen that many data scientists learn "oth languages *ython and + to counter the limitations of either language. -eing prepared with "oth languages will help in data science 8o" interviews. CLICK HERE
to get the @;: data scientist salary report delivered to your in"ox?
*ython is the /friendly0 programming language that plays well with everyone and runs on everything. 4o it is hardly surprising that *ython offers quite a few li"raries that deal with data efficiently and is therefore used in data science. *ython was used for data science only in the recent years. -ut now that it has firmly esta"lished itself as an important languag e for Data 4cience, *ython programming is not going anywhere. $ostly *ython is used for data analysis when you need to integrate the results of data analysis into we" apps or if you need to add mathematical
!n our previous posts :;; Data 4cience !nterview Buestions and Answers &Ceneral' and :;; Data 4cience in + !nterview Buestions and Answers, we listed all the questions that can "e asked in data science 8o" interviews. )his article in the series, lists questions which are related to *ython programming and will pro"a"ly "e asked in data science interviews.
Data #cience $ython .nter!iew uestions and Answers
)he questions "elow are "ased on the course that is taught at Deyre E Data 4cience in *ython. )his is not a guarantee that these questions will "e asked in Data 4cience !nterviews. )he purpose of these questions is to make the reader aware of the kind of knowledge that an applicant for a Data 4cientist position needs to possess. Data 4cience !nterview Buestions in *ython are generally scenario "ased or pro"lem "ased questions where candidates are provided with a data set and asked to do data munging, data exploration, data visualization, modelling, machine learning, etc. $ost of the data science interview questions are su"8ective and the answers to these questions vary, "ased on the given data pro"lem. )he main aim of the interviewer is to see how you code, what are the visualizations you can draw from the data, the conclusions you can make from the data set, etc. 1) How can you build a simple logistic regression model in Python? 2) How can you train and interpret a linear regression model in SciKit learn? 3) Name a few libraries in Python used for ata !nalysis and Scientific computations"
6um*y, 4ci*y, *andas, 4ciit, $atplotli", 4ea"orn #) $hich library would you prefer for plotting in Python language% Seaborn or &atplotlib?
$atplotli" is the python li"rary used for plotting "ut it need s lot of fine7tuning to ensure that the plots look shiny. 4ea"orn helps data scientists create statistically and aesthetically
appealing meaningful plots. )he answer to this question varies "ased on the requirements for plotting data. ') $hat is the main difference between a Pandas series and a single(column atarame in Python? *) $rite code to sort a atarame in Python in descending order" +) How can you handle duplicate ,alues in a dataset for a ,ariable in Python? -) $hich .andom orest parameters can be tuned to enhance the predicti,e power of the model? /) $hich method in pandas"tools"plotting is used to create scatter plot matri0?
4catterFmatrix 1) How can you chec if a data set or time series is .andom?
)o check whether a dataset is random or not use the lag plot. !f the lag plot for the given dataset does not show any structure then it is random. 11) an we create a atarame with multiple data types in Python? 4f yes5 how can you do it? 12) 4s it possible to plot histogram in Pandas without calling &atplotlib? 4f yes5 then write the code to plot the histogram? 13) $hat are the possible ways to load an array from a te0t data file in Python? How can the efficiency of the code to load data file be impro,ed?
numpy.loadtxt &' 1#) $hich is the standard data missing marer used in Pandas?
6a6 1') $hy you should use NumPy arrays instead of nested Python lists? 1*) $hat is the preferred method to chec for an empty array in NumPy? 1+) 6ist down some e,aluation metrics for regression problems"
1-) $hich Python library would you prefer to use for ata &unging?
*andas 1/) $rite the code to sort an array in NumPy by the nth column?
Gsing argsort &' function this can "e achieved. !f there is an array H and you would like to sort the nth column then code for this will "e xIx I# n7:J.argsort &'J 2) How are NumPy and SciPy related? 21) $hich python library is built on top of matplotlib and Pandas to ease data plotting?
4ea"orn 22) $hich plot will you use to access the uncertainty of a statistic?
-ootstrap 23) $hat are some features of Pandas that you lie or dislie? 2#) $hich scientific libraries in SciPy ha,e you wored with in your pro7ect? 2') $hat is pylab?
A package that com"ines 6um*y, 4ci*y and $atplotli" into a single namespace. 2*) $hich python library is used for &achine 6earning?
4ciit7Kearn Kearn Data 4cience in *ython to "ecome an nterprise Data 4cientist
2asic $ython $rogramming .nter!iew uestions 2+) How can you copy ob7ects in Python?
)he functions used to copy o"8ects in *ython are7 :'
Lopy.copy &' for shallow copy
@'
Lopy.deepcopy &' for deep copy
%owever, it is not possi"le to copy all o"8ects in *ython using these functions. 5or instance, dictionaries have a separate copy method whereas sequences in *ython have to "e copied "y M4licing’. 2-) $hat is the difference between tuples and lists in Python?
)uples can "e used as keys for dictionaries i.e. they can "e hashed. Kists are muta"le whereas tuples are immuta"le 7 they cannot "e changed. )uples should "e used when the order of elements in a sequence matters. 5or example, set of actions that need to "e executed in sequence, geographic locations or list of points on a specific route. 2/) $hat is P8P-?
**N consists of coding guidelines for *ython language so that programmers can write reada"le code making it easy to use for any other person, later on. 3) 4s all the memory freed when Python e0its?
6o it is not, "ecause the o"8ects that are referenced from glo"al namespaces of *ython modules are not always de7allocated when *ython exits. 31) $hat does 9init9"py do?
FinitF.py is an empty py file used for importing a module in a directory. FinitF.py provides an easy way to organize the files. !f there is a module maindir
range &' returns a list whereas xrange &' returns an o"8ect that acts like an iterator for generating num"ers on demand. 33) How can you randomi;e the items of a list in place in Python?
4huffle &lst' can "e used for randomizing the items of a list in *ython 3#) $hat is a pass in Python?
*ass in *ython signifies a no operation statement indicating that nothing is to "e done.
3') 4f you are gi,es the first and last names of employees5 which data type in Python will you use to store them?
You can use a list that has first name and last name included in an element or use Dictionary. 3*) $hat happens when you e0ecute the statement mango
A name error will occur when this statement is executed in *ython. 3+) $rite a sorting algorithm for a numerical dataset in Python" 3-) =ptimi;e the below python code( word < >word> print word"99len99 :) !nswer% print Mword’.FlenF &' 3/) $hat is money patching in Python?
$onkey patching is a technique that helps the programmer to modify or extend other code at runtime. $onkey patching comes handy in testing "ut it is not a good practice to use it in production environment as de"ugging the code could "ecome difficult. #) $hich tool in Python will you use to find bugs if any?
*ylint and *ychecker. *ylint verifies that a module satisfies all the coding standards or not. *ychecker is a static analysis tool that helps find out "ugs in the course code. #1) How are arguments passed in Python( by reference or by ,alue?
)he answer to this question is neither of these "ecause passing semantics in *ython are completely different. !n all cases, *ython passes arguments "y value where all values are references to o"8ects. #2) ou are gi,en a list of N numbers" reate a single list comprehension in Python to create a new list that contains only those ,alues which ha,e e,en numbers from elements of the list at e,en indices" or instance if list@#A has an e,en ,alue the it has be included in the new output list because it has an e,en inde0 but if list@'A has an e,en ,alue it should not be included in the list because it is not at an e,en inde0"
Ix for x in list I# @J if xO@ PP ;J )he a"ove code will take all the num"ers present at even indices and then discard the odd num"ers. #3) 80plain the usage of decorators"
Decorators in *ython are used to modify or in8ect code in functions or classes. Gsing decorators, you can wrap a class or function method call so that a piece of code can "e executed "efore or after the execution of the original code. Decorators can "e used to check for permissions, modify or track the arguments passed to a method, logging the calls to a specific method, etc. ##) How can you chec whether a pandas data frame is empty or not?
)he attri"ute df.empty is used to check whether a data frame is empty or not. #') $hat will be the output of the below Python code B def multipliers :)% return @lambda 0% i C 0 for i in range :#)A print @m :2) for m in multipliers :)A
)he output for the a"ove code will "e I, ,,J. )he reason for this is that "ecause of late "inding the value of the varia"le i is looked up when any of the functions returned "y multipliers are called. #*) $hat do you mean by list comprehension?
)he process of creating a list while performing some operation on the data so that it can "e accessed using an iterator is referred to as Kist Lomprehension. xample# Iord &8' for 8 in string.asciiFuppercaseJ I=, , Q, N, R, Q;, Q:, Q@, QS, QT, Q=, Q, QQ, QN, QR, N;, N:, N@, NS, NT, N=, N, NQ, NN, NR, R;J #+)
$hat will be the output of the below code
word < Daeioubcdfg> print word @%3A E word @3%A
)he output for the a"ove code will "e# Maeiou"cdfgU. !n string slicing when the indices of "oth the slices collide and a /V0 operator is applied on the string it concatenates them. #-)
list< @DaF5FeF5FiF5FoF5FuFA
print list @-%A
)he output for the a"ove code will "e an empty list IJ. $ost of the people might confuse the answer with an index error "ecause the code is attempting to access a mem"er in the list whose index exceeds the total num"er of mem"ers in the list. )he reason "eing the cod e is trying to access the slice of a list at a starting index which is greater than the num"er of mem"ers in the list. #/)
$hat will be the output of the below code%
def foo :i< @A)% i"append :1) return i GGG foo :) GGG foo :)
)he output for the a"ove code will "e7 I:J I:, :J Argument to the function foo is evaluated only once when the function is defined. %owever, since it is a list, on every all the list is modified "y appending a : to it. ') an the lambda forms in Python contain statements?
6o, as their syntax is restrcited to single expressions and they are used for creating function o"8ects which are returned at runtime. )his list of questions for *ython interview questions and answers is not an exhaustive one and will continue to "e a work in progress. Ket us know in comments "elow if we missed out on any important question that needs to "e up here.
.ython Developer interview 'uestions )his Python e,eloper interview profile "rings together a snapshot of what to look for in candidates with a "alanced sample of suita"le interview questions.
•
.ntroduction
•
Computing #cience uestions
•
&ole #pecifc uestions
!n some respects even the most technical role demands qualities common to strong candidates for all positions# the willingness to learn1 qualified skills1 passion for the 8o". ven college performance, while it helps you to assess formal education, doesn’t give a complete picture. )his is not to underplay the importance of a solid "ackground in computer science. 4ome things to look for# W Gnderstanding of "asic algorithmic concepts W Discuss "asic algorithms, how would they find
Computing #cience uestions •
9sing pseudocode# reverse a String iteratively and recursively
•
5hat constitutes a good unit test and what a "unctional one3
&ole #pecifc uestions •
Do arguments in .ython get passed by re"erence or by value3
•
5hy are "unctions considered rst class ob-ects in .ython3
•
5hat tools do you use "or linting# debugging and proling3
•
7ive an e&le o" lter and reduce over an iterable ob-ect
•
•
•
Implement the linu& whereis command that locates the binary# source# and manual page les "or a command$ 5hat are list and dict comprehensions3 5hat do we mean when we say that a certain =ambda e&pression "orms a closure3
•
5hat is the di0erence between list and tuple3
•
5hat will be the output o" the "ollowing code3
•
o
list \ Za# b# c# d# e[
o
print listZ6[
5hat will be the output o" the "ollowing code in each step3 o
class C
o
dangerous \ 8
o o
c6 \ C()
o
c8 \ C()
o
print c6$dangerous
o
o
c6$dangerous \ :
o
print c6$dangerous
o
print c8$dangerous
o o
del c6$dangerous
o
print c6$dangerous
o o
C$dangerous \ :
o
print c8$dangerous
4op .ython Interview uestions P !ost Asked
%ere are top S; o"8ective type sample *ython !nterview questions and their answers are given 8ust "elow to them. )hese sample questions are framed "y experts from !ntellipaat who trains for *ython training to give you an idea of type of questions which may "e asked in interview. 2e have taken full care to give correct answers for all the questions. Do comment your thoughts %appy >o" %unting?
8op Answers to $ython .nter!iew uestions 6$ 5hat is .ython3 .ython is an ob-ect oriented and opensource programming language# which supports structured and "unctional builtin data structures$ 5ith a placid and easyto understand synta .ython allows code reuse and modularity o" programs$ 4he builtin DS in .ython makes it a wonder"ul option "or +apid Application Development (+AD)$ 4he coding language also encourages "aster editing# testing and debugging with no compilation steps$ 8$ 5hat are the standard data types supported by .ython3 It supports si& data types 6$ ?umber ob-ect stored as numeric value 8$ String ob-ect stored as string :$ 4uple data stored in the "orm o" se'uence o" immutable ob-ects ;$ Dictionary (dicts) associates one thing to another irrespective o" the type o" data# most use"ul container (called hashes in C and Nava)
<$ =ist data stored in the "orm o" a list se'uence $ Set ("ro*enset) unordered collection o" distinct ob-ects :$ 1&plain builtin se'uence types in .ython .rogramming3 It provides two built in se'uence types 6$ !utable 4ype ob-ects whose value can be changed a"ter creation# e&le sets# items in the list# dictionary 8$ Immutable type ob-ects whose value cannot be changed once created# e&le number# oolean# tuple# string ;$ 1&plain the use o" iterator in .ython3 .ython coding uses Iterator to implement the iterator protocol# which enables traversing trough containers and group o" elements like list$4he two important methods include ]iter]() returning the iterator ob-ect and ne&t() method "or traversal$ <$Dene .ython slicing 3 4he process o" e&tracting a range o" elements "rom lists# arrays# tuples and custom .ython data structures as well$ It works on a general start and stop method slice (start# stop# increment) $ How can you compare two lists in .ython3 5e can simply per"orm it using compare "unction P cmp(intellipaatlist6# intellipaatlist8) de" cmp(intellipaatlist6# intellipaatlist8) "or val in intellipaatlist6 i" val in intellipaatlist8 return4rue return>alse B$ 5hat is the use o" ,, operator3 ,,% is a >loor Divisionoperator# which divides two operands with the result as 'uotient showing only digits be"ore decimal point$>or instance# ,,: \ 8 and $,,:$ \ 8$ T$Dene docstring in .ython with e&le$ A string literal occurring as the rst statement (like a comment) in any module# class# "unction or method is re"erred as docstring in .ython$ 4his kind o" string becomes the ]doc] special attribute o" the ob-ect and provides an easy way to document a particular code segment$ !ost modules do contain docstrings and thus# the "unctions and classes e&tracted "rom the module also consist o" docstrings$
W$ 5hat "unction randomi*es the items o" a list in place3 9sing shu^e() "unction >or instance import randomi*e lst \ Z8# 6T# T# ;[O randomi*e$shu^e(lst) print EShu^ed list E# lst random$shu^e(list) print E+eshu^ed list E# list 6$ =ist ve benets o" using .ython3 6$ Having the builtin data types# .ython saves programmer%s time and e0ort "rom declaring variables$ It has a power"ul dict ionary and polymorphic list "or automatic declaration$ It also ensures better code reusability 8$ Highly accessible and easytolearn "or beginners and a strong glue% "or advanced .ro"essionals consisting "o several highlevel modules and operations not per"ormed by other programming languages$ :$ Allows easy readability due to use o" s'uare brackets "or most "unctions and inde&es ;$ .ython re'uires no e&plicit memory management as the interpreter itsel" allocates the memory to new variables and "ree them automatically$ <$ .ython comprises a huge standard library "or most Internet plat"orms like 1mail# H4!=# >4. and other 555 plat"orms$ 66$5hat are the disadvantages o" using .ython3 6$ .ython is slow as compared to other programming languages$ Although# this slow pace doesn%t matter much# at times# we need other language to handle per"ormancecritical situations$ 8$ It is ine0ective on mobile plat"ormsO "ewer mobile applications are developed using python$ 4he main reason behind its instability on smartphones is .ython%s weakest security$ 4here are no good secure cases available "or .ython until now :$ Due to dynamic typing# .rogrammers "ace design restrictions while using the language$ 4he code needs more and more testing be"ore putting it into action since the errors pop up only during runtime$ ;$ 9nlike NavaScript# .ython%s "eatures like concurrency and parallelism are not developed "or elegant use$ 68$ 1&plain the use o" split "unction3 4he split() "unction in .ython breaks a string into shorter strings using the dened separator$ It renders a list o" all words present in the string$ ___ y\ true#"alse#none%
___ y$split(#%) +esult (true%# "alse%# none%) 5hat is the use o" generators in .ython3 7enerators are primarily used to return multiple items but one a"ter the other$ 4hey are used "or iteration in .ython and "or calculating large result sets$ 4he generator "unction halts until the ne&t time re'uest is placed$ @ne o" the best uses o" generators in .ython coding is implementing callback operation with reduced e0ort and time$ 4hey replace callback with iteration$ 4hrough the generator approach# programmers are saved "rom writing a separate callback "unction and pass it to work"unction as it can applying "or% loop around the generator$ 6:$ How to create a multidimensional list in .ython3 As the name suggests# a multidimensional list is the concept o" a list holding another list# applying to many such lists$ It can be one easily done by creating single dimensional list and lling each element with a newly created list$ 6;$ 5hat is lambda3 lambda is a power"ul concept used in con-unction with other "unctions like lter()# map()# reduce()$ 4he ma-or use o" lambda construct is to create anonymous "unctions during runtime# which can be used where they are created$ Such "unctions are actually known as throwaway "unctions in .ython$ 4he general synta& is lambda argument]liste&pression$ >or instance ___ de" intellipaat6 \ lambda i# n iQn ___ intellipaat(8#8) ; 9sing lter() __ intellipaat \ Z6# # 66# 86# 8W# 6T# 8;[ __ print lter (lambda & &X: \ \ # intellipaat) Z# 86# 6T# 8;[ 6<$ Dene .ass in .ython3 4he pass statement in .ython is e'uivalent to a null operation and a placeholder# wherein nothing takes place a"ter its e&ecution$ It is mostly used at places where you can let your code go even i" it isn%t written yet$ I" you would set out a pass a"ter the code# it won%t run$ 4he synta& is pass 6$ How to per"orm 9nit 4esting in .ython3 +e"erred to as .y9nit# the python 9nit testing "rameworkunittest supports automated testing# seggregating test into collections# shutdown testing code and testing independence "rom reporting "ramework$ 4he unittest module
makes use o" 4estCase class "or holding and preparing test routines and clearing them a"ter the success"ul e&ecution$ 6B$ Dene .ython tools "or nding bugs and per"orming static analysis3 $ .yChecker is an e&cellent bug nder tool in .ython# which per"orms static analysis unlike C,CQQ and Nava$ It also noties the programmers about the comple&ity and style o" the code$ In addition# there is another tool# .y=int "or checking the coding standards including the code line length# variable names and whether the inter"aces declared are "ully e&ecuted or not$ 6T$ How to convert a string into list3 9sing the "unction list(string)$ >or instance ___ list(intellipaat%) in your lines o" code will return Zi%# n%# t%# e%# l%# l%# i%# p%# a%# a%# t%[ In .ython# strings behave like list in various ways$ =ike# you can access individual characters o" a string __ _ y \ EintellipaatF ___ sZ8[ t% 6W$ 5hat @S do .ython support3 =inu 5indows# !ac @S K# I+IK# Compa'# Solaris 8$ ?ame the Nava implementation o" .ython3 Nython 86$ Dene docstring in .ython$ A string literal occurring as the rst statement (like a comment) in any module# class# "unction or method is re"erred as docstring in .ython$ 4his kind o" string becomes the ]doc] special attribute o" the ob-ect and provides an easy way to document a particular code segment$ !ost modules do contain docstrings and thus# the "unctions and classes e&tracted "rom the module also consist o" docstrings$ 88$ ?ame the optional clauses used in a trye&cept% statement in .ython3 5hile .ython e&ception handling is a bit di0erent "rom Nava# the "ormer provides an option o" using a trye&cept clause where the programmer receives a detailed error message without termination the program$ Sometimes# along with the problem# this trye&cept statement o0ers a solution to deal with the error$ 4he language also provides trye&ceptnally and trye&ceptelse blocks$
8:$ How to use [email protected] .Y4H@?.A4H is the environment variable consisting o" directories$ `.Y4H@?.A4H is used "or searching the actual list o" "olders "or libraries$ 8;$ Dene sel"% in .ython3 sel" is a re"erence to the current instance o" the class$ It is -ust like this% in NavaScript$ 5hile we create an instance o" a class# that instance has its data# which internally passes a re"erence to itsel"% 8<$ Dene C7I3 Common 7ateway Inter"ace support in .ython is an e&ternal gateway to interact with H44. server and other in"ormation servers$ It consists o" a series o" standards and instructions dening the e&change o" in"ormation between a custom script and web server$ 4he H44. server puts all important and use"ul in"ormation concerning the re'uest in the script environment and then run the script and sends it back in the "orm o" output to the client$ 8$ 5hat is .Y4H@?S4A+49. and how is it used3 .Y4H@?S4A+49. is yet another environment variable to test the .ython le in the interpreter using interactive mode$ 4he script le is e&ecuted even be"ore the rst prompt is seen$ Additionally# it also allows reloading o" the same script le a"ter being modied in the e&ternal editor$ 8B$ 5hat is the return value o" trunc() in .ython3 truc() returns integer value$ 9ses the ]trunc] method ___ import intellipaat intellipaat$trunc(;$:;) ; 8T$ How to convert a string to an ob-ect in .ython3 4o convert string into ob-ect# .ython provides a "unction eval(string)$ It allows the .ython code to run in itsel" 8W$ Is there any "unction to change case o" all letters in the string3 Yes# .ython supports a "unction swapcase()# which swaps the current letter case o" the string$ 4his method returns a copy o" the string with the string case swapped$ :$5hat is pickling and unpickling in .ython3 4he process o" .ickling relates to the .ickle module$ .ickle is a general module that ac'uires a python ob-ect and converts it into string$ It "urther
dumps that string ob-ect into a le by using dump () "unction$ .ickle comprises two methods Dump () dumps an ob-ect to a le ob-ect and =oad () loads an ob-ect "rom a le ob-ect 9npickling is the reac'uiring process to per"orm retrieval o" the original .ython ob-ect "rom the stored string "or reuse$
op 2' Python 4nter,iew Iuestions
1) $hat is Python? $hat are the benefits of using Python?
*ython is a programming language with o"8ects, modules, threads, exc eptions and automatic memory management. )he "enefits of pythons are that it is simple and easy, porta"le, extensi"le, "uild7in data structure and it is an open source. 2) $hat is P8P -?
** N is a coding convention, a set of recommendation, a"out how to write your *ython code more reada"le. 3) $hat is picling and unpicling?
*ickle module accepts any * ython o"8ect and converts it into a string representation and dumps it into a file "y using dump function, this process is called pickling. 2hile the process of retrieving original *ython o"8ects from the stored string representation is called unpickling. #) How Python is interpreted?
*ython language is an interpreted language. *ython program runs directly from the source code. !t converts the source code that is written " y the programmer into an intermediate language, which is again translated into machine language that has to "e executed. ') How memory is managed in Python? •
*ython memory is managed "y *ython private heap space. All *ython o"8ects and data structures are located in a private heap. )he programmer does not have an access to this private heap and interpreter takes care of this *ython private heap.
•
)he allocation of *ython heap space for *ython o"8ects is done "y *ython memory manager. )he core A*! gives access to some tools for the programmer to code.
•
*ython also have an in"uilt gar"age collector, which recycle all the unused memory and frees the memory and makes it availa"le to the heap space.
*) $hat are the tools that help to find bugs or perform static analysis?
*yLhecker is a static analysis tool that detects the " ugs in *ython source code and warns a"out the style and complexity of the "ug. *ylint is another tool that verifies whether the modu le meets the coding standard. +) $hat are Python decorators?
A *ython decorator is a specific change that we make in *ython syntax to alter functions easily. -) $hat is the difference between list and tuple?
)he difference "etween list and tuple is that list is muta"le while tuple is not. )uple can "e hashed for e.g as a key for dictionaries. /) How are arguments passed by ,alue or by reference?
verything in *ython is an o"8ect and all varia"les hold references to the o"8ects. )he references values are according to the functions1 as a result you cannot change the value of the references. %owever, you can change the o"8ects if it is muta"le.
1) $hat is ict and 6ist comprehensions are?
)hey are syntax constructions to ease the creation of a Dictionary or Kist "ased on existing itera"le. 11) $hat are the built(in type does python pro,ides?
)here are muta"le and !mmuta"le types of *ythons "uilt in types $uta"le "uilt7in types •
Kist
•
4ets
•
Dictionaries
!mmuta"le "uilt7in types •
4trings
•
)uples
•
6um"ers
12) $hat is namespace in Python?
!n *ython, every name introduced has a place where it lives and can "e hooked ho oked for. )his is known as namespace. !t is like a "ox where a varia"le name is mapped to the o"8ect placed. 2henever the varia"le is searched out, this "ox will "e searched, to get corresponding o"8ect. 13) $hat is lambda in Python?
!t is a single expression anonymous function often used as inline function. 1#) $hy lambda forms in python does not ha,e statements?
A lam"da form in python does not have statements as it is used to make new function o"8ect and then return them at runtime. 1') $hat is pass in Python?
*ass means, no7operation *ython statement, or in other words it is a place holder in compound statement, where there should "e a "lank left and nothing has to "e written there. 1*) 4n Python what are iterators?
!n *ython, iterators are used to iterate a group of elements, containers like list. 1+) $hat is unittest in Python?
A unit testing testing framework framework in *ython is is known as unittest. !t supports sharing sharing of setups, automation testing, shutdown code for tests, aggregation of tests into collections etc. 1-) 4n Python what is slicing?
A mechanism to select a range of items from sequence types like list, tuple, strings etc. is known as slicing. 1/) $hat are generators in Python?
)he way of implementing iterators iterators are known as generators. !t is a normal function except that it yields expression in the function. 2) $hat is docstring in Python?
A *ython documentation string is known as docstring, it is a way of documenting *ython functions, modules and classes. 21) How can you copy an ob7ect in Python?
)o copy an o"8ect in *ython, you can try copy.copy &' or copy.deepcopy&' for the general case. You You cannot copy c opy all o"8ects "ut "u t most of them. 22) $hat is negati,e inde0 in Python?
*ython sequences can "e index in in positive and negative num"ers. 5or positive index, ; is the first index, : is the second index and so forth. 5or negative index, &7:' is the last index and &7@' is the second last index and so forth. 23) How you can con,ert a number to a string?
!n order to convert a num"er into a string, use the in"uilt function str&'. !f you want a octal or hexadecimal representation, use the in"uilt function oct&' or he x&'. 2#) $hat is the difference between Jrange and range?
Hrange returns the xrange o"8ect while range returns the list, and uses the same memory and no matter what the range size is. 2') $hat is module and pacage in Python?
!n *ython, module is the way wa y to structure program. ach *ython program file is a module, which imports other modules like o"8ects and attri"utes. )he folder of *ython program is is a package of modules. A package can have modules or su"folders.
21 &ust(Know ata Science 4nter,iew Iuestions and !nswers Dnuggets ditors "ring you the answers to @; Buestions to Detect 5ake Data 4cientists, including what is regularization, Data 4cientists we admire, model validation, and more. 2y 3regory $iatetsky' $iatetsky' 7Dnuggets-
comments 4he recent post on /Dnuggets 8 uestions to Detect >ake >ake Data Scientists has been very popular most viewed in the month o" Nanuary$ Nanuary $ However these 'uestions were lacking answers# so /Dnuggets 1ditors got together and wrote the answers to these 'uestions$ I also added one more critical 'uestion number 86# which was omitted "rom the 8 'uestions post$
Here are the answers$ ecause o" the length# here are the answers to the rst 66 'uestions# and here is part 8$ 8$ <- =plain what regulari>ation is and why it is use,ul-
Answer by "at "atth thew ew "ayo $ +egulari*ation is the process o" adding a tuning parameter to a model to induce smoothness in order to prevent overtting overtting$$ (see also /Dnuggets posts on @vertting @vertting))
4his is most o"ten done by adding a constant multiple to an e&isting weight vector$ vector$ 4his constant is o"ten either the =6 (=asso) or (=asso) or =8 (ridge)# (ridge)# but can in actuality can be any norm$ 4he model predictions should then minimi*e the mean o" the loss "unction calculated on the regulari*ed training set$ Kavier Amatriain presents a good comparison o" =6 and =8 regulari*ation here# here # "or those interested$
*ig < 5p ball As the !alue o, p o, p decreases' decreases' the si>e o, the corresponding 54 p space p space also decreases-
- hich data scientists do you admire most? which startups?
Answer by 3regory $iatetsky 4his 'uestion does not have a correct answer# but here here is my personal list o" 68
Data Scientists I most admire# not in any particular order$
7eo0 Hinton# Yann =eCun# and Yoshua engio "or persevering with ?eural ?ets when and starting the current Deep =earning revolution$
Demis Hassabis# "or his ama*ing work on Deep!ind# which achieved human or superhuman per"ormance on Atari games and recently 7o$ Nake .orway "rom Data/ind and +ayid 7hani "rom 9$ Chicago,DSS7# "or enabling data science contributions to social good$ DN .atil# >irst 9S Chie" Data Scientist# "or using Data Science to make 9S government work better$ /irk D$ orne "or his inRuence and leadership on social media$ Claudia .erlich "or brilliant work on ad ecosystem and serving as a great /DD86; chair$ Hilary !ason "or great work at itly and inspiring others as a ig Data +ock Star$ 9sama >ayyad# "or showing leadership and setting high goals "or /DD and Data Science# which helped inspire me and many thousands o" others to do their best$ Hadley 5ickham# "or his "antastic work on Data Science and Data Uisuali*ation in +# including dplyr# ggplot8# and +studio$ 4here are too many e&cellent startups in Data Science area# but I will not list them here to avoid a conRict o" interest$ Here is some o" our previous coverage o" startups$ @- :ow would you !alidate a model you created to generate a predicti!e model o, a +uantitati!e outcome !ariable using multiple regression-
Answer by "atthew "ayo $ .roposed methods "or model validation •
•
I" the values predicted by the model are "ar outside o" the response variable range# this would immediately indicate poor estimation or model inaccuracy$ I" the values seem to be reasonable# e&amine the parametersO any o" the "ollowing would indicate poor estimation or multicollinearity opposite signs o" e&pectations# unusually large or small values# or observed inconsistency when the model is "ed new data$
•
•
•
9se the model "or prediction by "eeding it new data# and use the coecient o" determination (+ s'uared) as a model validity measure$ 9se data splitting to "orm a separate dataset "or estimating model parameters# and another "or validating predictions$ 9se -ackkni"e resampling i" the dataset contains a small number o" instances# and measure validity with + s'uared and mean s'uared error (!S1)$
- =plain what precision and recall are- :ow do they relate to the &1C cur!e?
Answer by 7regory .iatetsky Here is the answer "rom /Dnuggets >A .recision and +ecall Lalculating precision and recall is actually quite easy. !magine there are :;; positive cases among :;,;;; cases. You want to predict which ones are positive, and you pick @;; to have a "etter chance of catching many of the :;; positive cases. You record the !Ds of your predictions, and when you get the actual results you sum up how many times you were right or wrong. )here are four ways of "eing right or wrong#
6$ 8B / 8rue Begati!e case was negative and predicted negative 8$ 8$ / 8rue $ositi!e case was positive and predicted positive :$ *B / *alse Begati!e case was positive but predicted negative ;$ *$ / *alse $ositi!e case was negative but predicted positive $akes sense so far3 6ow you count how many of the :;,;;; cases fall in each "ucket, say#
Predicted Negati,e
Predicted Positi,e
Negati,e ases
)6# R,Q;
5*# :T;
Positi,e ases
56# T;
)*# ;
6ow, your "oss asks you three questions#
6$ 5hat percent o" your predictions were correct3 You answer the JaccuracyJ was (W#BQ) out o" 6# \ WT$8X 8$ 5hat percent o" the positive cases did you catch3 You answer the JrecallJ was out o" 6 \ X :$ 5hat percent o" positive predictions were correct3 You answer the JprecisionJ was out o" 8 \ :X
See also a very good e&planation o" .recision and recall in 5ikipedia$
*ig $recision and &ecall $
+@C curve represents a relation between sensitivity (+1CA==) and specicity(?@4 .+1CISI@?) and is commonly used to measure the per"ormance o" binary classiers$ However# when dealing with highly skewed datasets# .recision+ecall (.+) curves give a more representative picture o" per"ormance$ See also th is uora answer 5hat is the di0erence between a +@C curve and a precisionrecall curve3 $ - :ow can you pro!e that one impro!ement you!e brought to an algorithm is really an impro!ement o!er not doing anything?
Answer by Anmol &a0purohit$ @"ten it is observed that in the pursuit o" rapid innovation (aka J'uick "ameJ)# the principles o" scientic methodology are violated leading to misleading innovations# i$e$ appealing insights that are conrmed without rigorous validation$ @ne such scenario is the case that given the task o" improving an algorithm to yield better results# you might come with several ideas with potential "or improvement$ An obvious human urge is to announce these ideas ASA. and ask "or their implementation$ 5hen asked "or supporting data# o"ten limited results are shared# which are very likely to be impacted by selection bias (known or unknown) or a misleading global minima (due to lack o" appropriate variety in test data)$ Data scientists do not let their human emotions overrun their logical reasoning$ 5hile the e&act approach to prove that one improvement youve brought to an algorithm is really an improvement over not doing anything would depend on the actual case at hand# there are a "ew common guidelines •
•
•
•
•
1nsure that there is no selection bias in test data used "or per"ormance comparison 1nsure that the test data has sucient variety in order to be symbolic o" real li"e data (helps avoid overtting) 1nsure that Jcontrolled e&perimentJ principles are "ollowed i$e$ while comparing per"ormance# the test environment (hardware# etc$) must be e&actly the same while running original algorithm and new algorithm 1nsure that the results are repeatable with near similar results 1&amine whether the results reRect local ma&ima,minima or global ma&ima,minima
@ne common way to achieve the above guidelines is through A, testing# where both the versions o" algorithm are kept running on similar environment "or a considerably long time and realli"e input data is randomly split between the two$ 4his approach is particularly common in 5eb Analytics$ E- hat is root cause analysis?
Answer by 7regory .iatetsky According to 5ikipedia# +oot cause analysis (+CA) is a method o" problem solving used "or identi"ying the root causes o" "aults or problems$ A "actor is considered a root cause i" removal thereo" "rom the problem"aultse'uence prevents the nal undesirable event "rom recurringO whereas a causal "actor is one that a0ects an events outcome# but is not a root cause$
+oot cause analysis was initially developed to analy*e industrial accidents# but is now widely used in other areas# such as healthcare# pro-ect management# or so"tware testing$ Here is a use"ul +oot Cause Analysis 4oolkit "rom the state o" !innesota$ 1ssentially# you can nd the root cause o" a problem and show the relationship o" causes by repeatedly asking the 'uestion# J5hy3J# until you nd the root o" the problem$ 4his techni'ue is commonly called J< 5hysJ# although is can be involve more or less than < 'uestions$
*ig- hys Analysis =ample' ,rom 8he Art o, &oot Cause Analysis F- Are you ,amiliar with price optimi>ation' price elasticity' in!entory management' competiti!e intelligence? 3i!e eamples-
Answer by 7regory .iatetsky 4hose are economics terms that are not "re'uently asked o" Data Scientists but they are use"ul to know$ .rice optimi*ation is the use o" mathematical tools to determine how customers will respond to di0erent prices "or its products and services through di0erent channels$ ig Data and data mining enables use o" personali*ation "or price optimi*ation$ ?ow companies like Ama*on can even take optimi*ation "urther and show di0erent prices to di0erent visitors# based on their history# although there is a strong debate about whether this is "air$ .rice elasticity in common usage typically re"ers to •
.rice elasticity o" demand# a measure o" price sensitivity$ It is computed as .rice 1lasticity o" Demand \ X Change in uantity Demanded , X Change in .rice$
Similarly# .rice elasticity o" supply is an economics measure that shows how the 'uantity supplied o" a good or service responds to a change in its price$
Inventory management is the overseeing and controlling o" the ordering# storage and use o" components that a company will use in the production o" the items it will sell as well as the overseeing and controlling o" 'uantities o" nished products "or sale$ 5ikipedia denes Competitive intelligence the action o" dening# gathering# analy*ing# and distributing intelligence about products# customers# competitors# and any aspect o" the environment needed to support e&ecutives and managers making strategic decisions "or an organi*ation$
4ools like 7oogle 4rends# Ale&a# Compete# can be used to determine general trends and analy*e your competitors on the web$ G- hat is statistical power?
Answer by 7regory .iatetsky 5ikipedia denes Statistical power or sensitivity o" a binary hypothesis test is the probability that the test correctly re-ects the null hypothesis (H) when the alternative hypothesis (H6) is true$ 4o put in another way# Statistical power is the likelihood that a study will detect an e0ect when the e0ect is present$ 4he higher the statistical power# the less likely you are to make a 4ype II error (concluding there is no e0ect when# in "act# there is)$ Here are some tools to calculate statistical power$ H- =plain what resampling methods are and why they are use,ul- Also eplain their limitations-
Answer by 7regory .iatetsky Classical statistical parametric tests compare observed statistics to theoretical sampling distributions$ +esampling a datadriven# not theorydriven methodology which is based upon repeated sampling within the same sample$ +esampling re"ers to methods "or doing one o" these •
1stimating the precision o" sample statistics (medians# variances# percentiles) by using subsets o" available data (-ackkning) or drawing randomly with replacement "rom a set o" data points (bootstrapping)
•
•
1&changing labels on data points when per"orming signicance tests (permutation tests# also called e&act tests# randomi*ation tests# or re randomi*ation tests) Ualidating models by using random subsets (bootstrapping# cross validation)
See more in 5ikipedia about bootstrapping# -ackkning$ See also How to Check Hypotheses with ootstrap and Apache Spark
Here is a good overview o" +esampling Statistics$
Answer by De!endra Desale$ It depends on the 'uestion as well as on the domain "or which we are trying to solve the 'uestion$ In medical testing# "alse negatives may provide a "alsely reassuring message to patients and physicians that disease is absent# when it is actually present$ 4his sometimes leads to inappropriate or inade'uate treatment o" both the patient and their disease$ So# it is desired to have too many "alse positive$ >or spam ltering# a "alse positive occurs when spam ltering or spam blocking techni'ues wrongly classi"y a legitimate email message as spam and# as a result# inter"eres with its delivery$ 5hile most antispam tactics can block or lter a high percentage o" unwanted emails# doing so without creating signicant "alsepositive results is a much more demanding task$ So# we pre"er too many "alse negatives over many "alse positives$ <<- hat is selection bias' why is it important and how can you a!oid it?
Answer by "atthew "ayo$ Selection bias# in general# is a problematic situation in which error is introduced due to a nonrandom population sample$ >or e&le# i" a given sample o" 6 test cases was made up o" a ,8,6<,< split o" ; classes which actually occurred in relatively e'ual numbers in the population# then a given model may make the "alse assumption that probability could be the determining predictive "actor$ Avoiding nonrandom samples is the best way to deal with biasO however# when this is impractical# techni'ues such as resampling# boosting# and weighting are strategies which can be introduced to help deal with the situation$
86 !ust/now Data Science Interview uestions and Answers# part 8 4econd part of the answers to @; Buestions to Detect 5ake Data 4cientists, including controlling overfitting, experimental design, tall and wide data, unde rstanding the validity of statistics in the media, and more. 2y 3regory $iatetsky' 7Dnuggets-
comments 4he post on /Dnuggets 8 uestions to Detect >ake Data Scientists has been very popular most viewed post o" the month$ However these 'uestions were lacking answers# so /Dnuggets 1ditors got together and wrote the answers$ Here is part 8 o" the answers# starting with a JbonusJ 'uestion$
2onus uestion =plain what is o!erftting and how would you control ,or it
4his 'uestion was not part o" the original 8# but probably is the most important one in distinguishing real data scientists "rom "ake ones$ Answer by 3regory $iatetsky-
@vertting is nding spurious results that are due to chance and cannot be reproduced by subse'uent studies$ 5e "re'uently see newspaper reports about studies that overturn the previous ndings# like eggs are no longer bad "or your health# or saturated "at is not linked to heart disease$ 4he problem# in our opinion is that many researchers# especially in social sciences or medicine# too "re'uently commit the cardinal sin o" Data !ining 1!erftting the data 4he researchers test too many hypotheses without proper statistical control# until they happen to nd something interesting and report it$ ?ot surprisingly# ne&t time the e0ect# which was (at least partly) due to chance# will be much smaller or absent$
4hese Raws o" research practices were identied and reported by Nohn .$ A$ Ioannidis in his landmark paper +h ost #ublished Research -indings re -alse (.=oS !edicine# 8<)$ Ioannidis "ound that very o"ten either the results were e&aggerated or the ndings could not be replicated$ In his paper# he presented statistical evidence that indeed most claimed research ndings are "alse$ Ioannidis noted that in order "or a research nding to be reliable# it should have •
=arge sample si*e and with large e0ects
•
7reater number o" and lesser selection o" tested relationship
•
•
7reater Re&ibility in designs# denitions# outcomes# and analytical modes !inimal bias due to nancial and other "actors (including popularity o" that scientic eld)
9n"ortunately# too o"ten these rules were violated# producing irreproducible results$ >or e&le# S. < inde& was "ound to be strongly related to .roduction o" butter in angladesh ("rom 6WTW6 to 6WW:) ( here is .D>)
See more interesting (and totally spurious) ndings which you can discover yoursel" using tools such as 7oogle correlate or Spurious correlations by 4yler Uigen$ Several methods can be used to avoid JoverttingJ the data
•
•
•
•
•
•
4ry to nd the simplest possible hypothesis +egulari*ation (adding a penalty "or comple&ity) +andomi*ation 4esting (randomi*e the class variable# try your method on this data i" it nd the same strong results# something is wrong) ?ested crossvalidation (do "eature selection on one level# then run entire method in crossvalidation on outer level) Ad-usting the >alse Discovery +ate 9sing the reusable holdout method a breakthrough approach proposed in 86<
7ood data science is on the leading edge o" scientic understanding o" the world# and it is data scientists responsibility to avoid overtting data and educate the public and the media on the dangers o" bad data analysis$ See also •
•
•
•
•
4he Cardinal Sin o" Data !ining and Data Science @vertting ig Idea 4o Avoid @vertting +eusable Holdout to .reserve Ualidity in Adaptive Data Analysis @vercoming @vertting with the reusable holdout .reserving validity in adaptive data analysis 66 Clever !ethods o" @vertting and how to avoid them 4ag @vertting
<- 3i!e an eample o, how you would use eperimental design to answer a +uestion about user beha!ior-
Answer by 2ha!ya 3eethika$
#tep < *ormulate the &esearch uestion
5hat are the e0ects o" page load times on user satis"action ratings3 #tep .denti,y !ariables
5e identi"y the cause e0ect$ Independent variable page load time# Dependent variable user satis"action rating #tep @ 3enerate :ypothesis
=ower page download time will have more e0ect on the user satis"action rating "or a web page$ Here the "actor we analy*e is page load time$
>ig 68 4here is a Raw in your e&perimental design (cartoon "rom here) #tep Determine =perimental Design-
5e consider e&perimental comple&ity i$e vary one "actor at a time or multiple "actors at one time in which case we use "actorial design (8k design)$ A design is also selected based on the type o" ob-ective (Comparative# Screening# +esponse sur"ace) number o" "actors$
Here we also identi"y withinparticipants# betweenparticipants# and mi&ed model$>or e$g$ 4here are two versions o" a page# one with uy button (call to action) on le"t and the other version has this button on the right$ 5ithinparticipants design both user groups see both versions$ etweenparticipants design one group o" users see version A the other user group version $ #tep De!elop eperimental task % procedure
Detailed description o" steps involved in the e&periment# tools used to measure user behavior# goals and success metrics should be dened$ Collect 'ualitative data about user engagement to allow statistical analysis$ #tep E Determine "anipulation % "easurements
!anipulation @ne level o" "actor will be controlled and the other will be manipulated$ 5e also identi"y the behavioral measures 6$ =atency time between a prompt and occurrence o" behavior (how long it takes "or a user to click buy a"ter being presented with products)$ 8$ >re'uency number o" times a behavior occurs (number o" times the user clicks on a given page within a time) :$ Durationlength o" time a specic behavior lasts(time taken to add all products) ;$ Intensity"orce with which a behavior occurs ( how 'uickly the user purchased a product)
#tep F Analy>e results
Identi"y user behavior data and support the hypothesis or contradict according to the observations made "or e$g$ how ma-ority o" users satis"action ratings compared with page load times$
<@- hat is the di;erence between JlongJ (JtallJ) and JwideJ ,ormat data? Answer by 3regory $iatetsky-
In most data mining , data science applications there are many more records (rows) than "eatures (columns) such data is sometimes called JtallJ (or JlongJ) data$ In some applications like genomics or bioin"ormatics you may have only a small number o" records (patients)# eg 6# but perhaps 8# observations "or each patient$ 4he standard methods that work "or JtallJ data will lead to overtting the data# so special approaches are needed$
*ig <@- Di;erent approaches ,or tall data and wide data # "rom presentation
Sparse Screening "or 1&act Data +eduction# by Nieping Ye$ 4he problem is not -ust reshaping the data (here there are use"ul + packages)# but avoiding "alse positives by reducing the number o" "eatures to nd most relevant ones$ Approaches "or "eature reduction like =asso are well covered in Statistical =earning
with Sparsity 4he =asso and 7enerali*ations# by Hastie# 4ibshirani# and 5ainwright$ (you can download "ree .D> o" the book) 4econd part of the answers to @; Buestions to Detect 5ake Data 4cientists, including controlling overfitting, experimental design, tall and wide data, unde rstanding the validity of statistics in the media, and more. *ages# : @ S 2y 3regory $iatetsky' 7Dnuggets-
<- hat method do you use to determine whether the statistics published in an article (or appeared in a newspaper or other media) are either wrong or presented to support the authors point o, !iew' rather than correct' comprehensi!e ,actual in,ormation on a specifc sub0ect?
A simple rule# suggested by ack =ipton# is i" some statistics are published in a newspaper# then they are wrong$ Here is a more serious answer by Anmol &a0purohit $ 1very media organi*ation has a target audience$ 4his choice impacts a lot o" decisions such as which article to publish# how to phrase an article# what part o" an article to highlight# how to tell a given story# etc$ In determining the validity o" statistics published in any article# one o" the rst steps will be to e&amine the publishing agency and its target audience$ 1ven i" it is the same news story involving statistics# you will notice that it will be published very di0erently across >o& ?ews vs$ 5SN vs$ AC!,I111 -ournals$ So# data scientists are smart about where to get the news "rom (and how much to rely on the stories based on sourcesL)$
*ig <a =ample o, a !ery misleading bar chart that appeared on *o Bews
*ig <b how the same data should be presented ob0ecti!ely # "rom < 5ays to
Avoid eing >ooled y Statistics @"ten the authors try to hide the inade'uacy o" their research through canny storytelling and omitting important details to -ump on to enticingly presented "alse insights$ 4hus# a thumbs rule to identi"y articles with misleading statistical in"erences is to e&amine whether the article includes details on the research methodology "ollowed and any perceived limitations o" the choices made related to research methodology$ =ook "or words such as Jsample si*eJ# Jmargin o" errorJ# etc$ 5hile there are no per"ect answers as to what sample si*e or margin o" error is appropriate# these attributes must certainly be kept in mind while reading the end results$
Another common case o" erratic reporting are the situations when -ournalists with poor dataeducation pick up an insight "rom one or two paragraphs o" a published research paper# while ignoring the rest o" research paper# -ust in order to make their point$ So# here is how you can be smart to avoid being "ooled by such articles >irstly# a reliable article must not have any unsubstantiated claims$ All the assertions must be backed with re"erence to past research$ @r otherwise# is must be clearly di0erentiated as an JopinionJ and not an assertion$ Secondly# -ust because an article is re"erring to renowned research papers# does not mean that it is using the insight "rom those research papers appropriately$ 4his can be validated by reading those re"erred research papers Jin entiretyJ# and independently -udging their relevance to the article at hand$ =astly# though the endresults might naturally seem like the most interesting part# it is o"ten "atal to skip the details about research methodology (and spot errors# bias# etc$)$ Ideally# I wish that all such articles publish their underlying research data as well as the approach$ 4hat way# the articles can achieve genuine trust as everyone is "ree to analy*e the data and apply the research approach to see the results "or themselves$
<- =plain =dward 8u,tes concept o, Jchart 0unk-J Answer by 3regory $iatetsky
Chart-unk re"ers to all visual elements in charts and graphs that are not necessary to comprehend the in"ormation represented on the graph# or that distract the viewer "rom this in"ormation$ 4he term chart-unk was coined by 1dward 4u"te in his 6WT: book /he 0isual Displa o! 1uantitative In!ormation $
*ig <- 4u"te writes Jan unintentional ?ecker Illusion# as two back planes optically
Rip to the "ront$ Some pyramids conceal othersO and one variable (stacked depth o" the stupid pyramids) has no label or scale$J
Here is a more modern e&le "rom e&celuser where it is very hard to understand the column plot because o" workers and cranes that obscure them$ 4he problem with such decorations is that they "orces readers to work much harder than necessary to discover the meaning o" data$
Answer by 2ha!ya 3eethika$ Some methods to screen outliers are *scores# modied *score# bo& plots# 7rubbs test# 4iet-en!oore test e&ponential smoothing# /imber test "or e&ponential
distribution and moving window lter algorithm$ However two o" the robust methods in detail are .nter uartile &ange
An outlier is a point o" data that lies over 6$< I+s below the rst 'uartile (6) or above third 'uartile (:) in a given data set$ •
High \ (:) Q 6$< I+
•
=ow \ (6) 6$< I+
8ukey "ethod
It uses inter'uartile range to lter very large or very small numbers$ It is practically the same method as above e&cept that it uses the concept o" J"encesJ$ 4he two values o" "ences are •
=ow outliers \ 6 6$<(: 6) \ 6 6$<(I+)
•
High outliers \ : Q 6$<(: 6) \ : Q 6$<(I+)
Anything outside o" the "ences is an outlier$ 5hen you nd outliers# you should not remove it without a 'ualitative assessment because that way you are altering the data and making it no longer pure$ It is important to understand the conte&t o" analysis or importantly J4he 5hy 'uestion 5hy an outlier is di0erent "rom other data points3J 4his reason is critical$ I" outliers are attributed to error# you may throw it out but i" they signi"y a new trend# pattern or reveal a valuable insight into the data you should retain it$
Answer by "atthew "ayo $ 1&treme value theory (1U4) "ocuses on rare events and e&tremes# as opposed to classical approaches to statistics which concentrate on average behaviors$ 1U4 states that there are : types o" distributions needed to model the the e&treme data
points o" a collection o" random observations "rom some distribution the 7umble# >rechet# and 5eibull distributions# also known as the 1&treme Ualue Distributions (1UD) 6# 8# and :# respectively$ 4he 1U4 states that# i" you were to generate ? data sets "rom a given distribution# and then create a new dataset containing only the ma&imum values o" these ? data sets# this new dataset would only be accurately described by one o" the 1UD distributions 7umbel# >rechet# or 5eibull$ 4he 7enerali*ed 1&treme Ualue Distribution (71U) is# then# a model combining the : 1U4 models as well as the 1UD model$ /nowing the models to use "or modeling our data# we can then use the models to t our data# and then evaluate$ @nce the best tting model is "ound# analysis can be per"ormed# including calculating possibilities$
5e are all "amiliar now with recommendations "rom ?etRi& J@ther !ovies you might en-oyJ or "rom Ama*on Customers who bought K also bought Y$#
Such systems are called recommendation engines or more broadly recommender systems$ 4hey typically produce recommendations in one o" two ways using collaborati!e or content4based ltering$ Collaborati!e fltering methods build a model based on users past behavior
(items previously purchased# movies viewed and rated# etc) and use decisions made by current and other users$ 4his model is then used to predict items (or ratings "or
items) that the user may be interested in$ Content4based fltering methods use "eatures o" an item to recommend
additional items with similar properties$ 4hese approaches are o"ten combined in Hybrid +ecommender Systems$ Here is a comparison o" these 8 approaches used in two popular music recommender systems =ast$"m and .andora +adio$ (e&le "rom +ecommender System entry) •
•
=ast$"m creates a JstationJ o" recommended songs by observing what bands and individual tracks the user has listened to on a regular basis and comparing those against the listening behavior o" other users$ =ast$"m will play tracks that do not appear in the users library# but are o"ten played by other users with similar interests$ As this approach leverages the behavior o" users# it is an e&le o" a collaborative ltering techni'ue$ .andora uses the properties o" a song or artist (a subset o" the ; attributes provided by the !usic 7enome .ro-ect) in order to seed a JstationJ that plays music with similar properties$ 9ser "eedback is used to rene the stations results# deemphasi*ing certain attributes when a user JdislikesJ a particular song and emphasi*ing other attributes when a user JlikesJ a song$ 4his is an e&le o" a contentbased approach$
Here is a good Introduction to +ecommendation 1ngines by Dataconomy and an overview o" building a Collaborative >iltering +ecommendation 1ngine by 4optal$ >or latest research on recommender systems# check AC! +ecSys con"erence$
In binary classication (or medical testing)# >alse positive is when an algorithm (or test) indicates presence o" a condition# when in reality it is absent$ A "alse negative is when an algorithm (or test) indicates absence o" a condition# when in reality it is present$ In statistical hypothesis testing "alse positive is also called type I error and "alse negative type II error$
It is obviously very important to distinguish and treat "alse positives and "alse negatives di0erently because the costs o" such errors can be hugely di0erent$ >or e&le# i" a test "or serious disease is "alse positive (test says disease# but person is healthy)# then an e&tra test will be made that will determine the correct diagnosis$ However# i" a test is "alse negative (test says healthy# but person has disease)# then treatment will be done and person may die as a result$
I- hich tools do you use ,or !isuali>ation? hat do you think o, 8ableau? &? #A#? (,or graphs)- :ow to eKciently represent dimension in a chart (or in a !ideo)?
Answer by 3regory $iatetsky
4here are many good tools "or Data Uisuali*ation$ +# .ython# 4ableau and 1&cel are among most commonly used by Data Scientists$ Here are use"ul /Dnuggets resources •
Uisuali*ation and Data !ining So"tware
•
@verview o" .ython Uisuali*ation 4ools
•
86 1ssential Data Uisuali*ation 4ools
•
4op : Social ?etwork Analysis and Uisuali*ation 4ools
•
4ag Data Uisuali*ation
4here are many ways to representing more than 8 dimensions in a chart$ :rd dimension can be shown with a :D scatter plot which can be rotate$ You can use color# shading# shape# si*e$ Animation can be used e0ectively to show time dimension (change over time)$ Here is a good e&le$
*ig Ia 4dimensional scatter plot o, .ris data # with si*e sepal lengthO color
sepal widthO shape classO &column petal lengthO ycolumn petal width# "rom here$ >or more than < dimensions# one approach is .arallel Coordinates# pioneered by Al"red Inselberg$
*ig Ib .ris data in parallel coordinates