Machine Learning on Source Code | Egor Bulychev | ML Conference 2018

Here is the meta description: Discover how machine learning can transform source code analysis, improving code prediction, generation, and review using novel techniques, algorithms, and libraries in this rapidly growing field.

Key takeaways
  • parse code refers to the process of transforming raw source code into a structured format, such as an abstract syntax tree (AST), to make it easier to analyze, manipulate, and generate code from.
  • Machine learning (ML) can be applied to source code to improve software development tasks like code prediction, generation, and review.
  • There are limitations to using existing ML libraries for source code analysis, such as high-dimensionalality and sparse data distribution, which can lead to poor performance.
  • CodeVec, a specific algorithm, generates vector representations of functions based on their source code.
  • The topic of machine learning on source code is a rapidly growing area, with many applications, such as code completion, program repair, and code optimization.
  • Code review is an important part of software development process and can be improved using ML-based approaches like intelligent code review.
  • The importance of understanding code structure and parsing code is emphasized, as it enables better analysis and manipulation of code.
  • Word2vec algorithms can be used to generate embeddings for identifiers in source code.
  • The community is working on various ML libraries for source code analysis, such as Lookout SDK ML, and it is difficult to generate robust and accurate results due to the complexity of the task.
  • The importance of handling contemporary software development practices, such as using GitHub and Git, is highlighted.
  • Real-world code challenges, such as social coding, social coding analytics, and variance in programming style, need to be taken into account when developing ML-based solutions for source code analysis.
  • Embeddings can be generated for large-scale datasets using algorithms like FastText and word2vec.
  • To address the limitations of existing ML libraries for source code analysis, researchers are exploring new techniques, such as neural networks and graph-based methods.
  • The rapid growth in the amount of code being written and the increasing complexity of software development necessitate the development of robust and accurate ML-based solutions for source code analysis.
  • Code optimization and program repair are important applications of ML in source code analysis.