A new online book has recently been published that may be of interest to our community:
This book provides resources and examples for people who want to use tidy tools from the R ecosystem to approach natural language processing tasks. The intended audience for this book includes people who don’t have extensive backgrounds in computational linguistics but who need or want to analyze unstructured, text-heavy data. Using tidy data principles can make text mining easier, more effective, and consistent with tools already in wide use like dplyr, broom, and ggplot2. Topics covered in the book include how to manipulate, summarize, and visualize the characteristics of text, sentiment analysis, tf-idf, and topic modeling.
The authors are still in the writing process and will be actively developing and honing the book in the near future, but it already contains many developed examples of using tidy data principles for text analysis.
Julia Silge is a data scientist at Datassist where her work involves analyzing and modeling complex data sets while communicating about technical topics with diverse audiences. She has a Ph.D. in astrophysics, as well as abiding affections for Jane Austen and making beautiful charts. Julia worked in academia and ed tech before moving into data science and discovering R.
David Robinson is a data scientist at Stack Overflow. He has a Ph.D. in Quantitative and Computational Biology from Princeton University, where he worked with Professor John Storey. His interests include statistics, data analysis, genomics, education, and programming in R and Python.
If you are the author of a book that is related to Software Carpentry or Data Carpentry’s mission, and would like to announce it here, please get in touch.