Idiomatic Python code can be faster than naive C/C++ code. The secret is to offload hot paths outside pure Python. Most such things are written that way already: you do not try to optimize the internal loop for a matrix multiplication in pure Python, you call numpy.dot() instead (pytorch if GPUs can help).
Otherwise, optimizing code in Python is the same as in any other language eg
Otherwise, optimizing code in Python is the same as in any other language eg
http://scipy-lectures.org/advanced/optimizing/
https://scikit-learn.org/stable/developers/performance.html