fixed tokenization counting error, added more english tokenization