The Yhat Blog


machine learning, data science, engineering


Interview with a Data Scientist Tool Developer

by Peadar Coyle |


About Peadar: Peadar Coyle is a data scientist, author and math geek who specializes in applying robust statistical or machine learning models to data to extract business value. His academic interests range from quantum computing to time series forecasting. Peadar has worked or consulted for Amazon, Vodafone, Import.io and JobTODAY, to name a few. He is a core developer of PyMC3 and a regular speaker and keynoter at prestigious industry conferences such as PyData. His recent book is available at https://leanpub.com/interviewswithdatascientists

Introduction

I interviewed one of the core members of the pandas Python Library Masaaki Horikoshi (sinhrks). I was really happy to interview him, and glad to show that Data-science and software development are really global things. :) I lightly edited his answers at his request because English is not his native language.

Masaaki Horikoshi's Biography

I work as a data analyst in a Japanese company. I mostly use Python and R in the work. Because I don’t expose project details of my job publicly, allow me to answer as a tool developer. I contribute to some open source software such as pandas (Python package for data analysis) in private, see https://github.com/sinhrks

Q & A

1. What project have you worked on do you wish you could go back to, and do better?

I’ve learned a lot from the projects I’ve worked on, therefore I expect I can do better in most of them today. It’s because the most difficult part of the project is to clarify what the problem actually is, and I already know what the it was on the previous ones at least some extent:)

2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?

I don’t have PhD, so my point may be basic. Even though the requirements are depending on what you’re working for.

I think it is a good learning experience to read source codes of popular OSS related to statistics / machine learning. I sometimes find myself not understanding a subject only by reading a textbook. Reading source codes and confirming each step sometimes reveal my misunderstandings. Also it can improve your programming skills because the software are mostly written in optimized and sophisticated ways.

3. What do you wish you knew earlier about being a data scientist/ data tool developer?

That communities are really important. It was only after I started attending some programming language conferences, I could meet a lot of skilled people in a broad range of fields, and communicating with them gives me a lot of knowledge in the fields I’m not familiar with. Also, feedback from tool users helps me to understand the needs and raises my motivation.

4. How do you respond when you hear the phrase ‘big data’?

I believe most of today’s companies have a lot of data. But it depends on the problem whether we actually need all of them. Using ‘big data’ without any specific objective looks unprofitable.

Technically I’m interested in data processing and visualization of these data and use some tools like Spark.

5. What is the most exciting thing about your field?

Popularity of data-science and related programming languages (R and Python). I see many interesting news and blog posts about data-science almost every day, and small conferences hold few times in a month. It is a good opportunity to join the field. And we need more people, there is a lot of work to do!

6. How do you go about framing a software engineering problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?

This is what I feel the most difficult question. The important thing is to clarify the target and goal first.

Then we can decide a measurable indicator and consider executable action / implementation. During the discussion with end users, we can get back to the target and goal once agreed and can judge whether it is “good enough”.

7. You’re involved with some open source projects, can you comment how important you feel these are and also what exciting new things you’ve worked on?

OSS is important to fulfill my daily requirements, besides this it is great place where we can learn more and give back to. I appreciate all the users and great contributors who I’ve got to work with!

Regards,

Masaaki Horikoshi (sinhrks)



Our Products


Rodeo: a native Python editor built for doing data science on your desktop.

Download it now!

ScienceOps: deploy predictive models in production applications without IT.

Learn More

Yhat (pronounced Y-hat) provides data science solutions that let data scientists deploy and integrate predictive models into applications without IT or custom coding.