in python you will need encoded files for this assignment download download the utf 4944812
In Python: You will need Encoded files for this assignment (download) Download the UTF-8 and UTF-16 files here:http://hills.ccsf.edu/~dputnam/utf_files.zip In this lab you will create your lab5.1.py script. Yourassignment is to write code that can analyze both Unicode UTF-8 andUTF-16 files. The unicode files contain text in several languages,including Armenian, Chinese, Danish, Korean, Turkish, andVietnamese. Since the files are a mix of types — alphabets, pictographs, andidographs — we won’t be doing any kind of sophisticated syntaxparsing. For better or worse, we’ll simply use this simple regex tomatch words: regex = re.compile(‘w+’)
Strategy Your script will determine the proper encoding (UTF-8 or UTF-16)based on each file’s name. UTF-8 files end with utf8 and UTF-16files end with utf16. Unzip utf_files.zip inside your public_html directory. You willcreate a directory named public_html/utf_files Create lab5.1.py, a script that reads and prints Unicodefiles in public_html/utf_files. Your script should perform thefollowing tasks for each file: 1- Split the Unicode text for each into words using the sameregular expression that you used in Lab 3. 2- Detect the encoding from the file name 3- Decode the file using the either the utf-8 orutf-16encoding. 4- Print the file name 5- Print the contents of the file 6- Print the number of characters in the file 7- Print the number of lines in the file 8- Print the number of words in the file 9- Print the ten most frequent words and their frequency. Forthis task you can recycle the code we wrote in Lab 3. Don’t botherwith stop words. 10- Run your script and save the output in a filenamedlab5_output.txt. See the example output for clues about how the output will look,depending on the Unicode completeness of the fonts installed onyour computer. Example output: https://hills.ccsf.edu/~dputnam/lab5.1.html
Here’s some code to get you started #!/usr/local/bin/python3# Name: Your Name# File: lab5.1.py# Date: # Desc: Script that decodes files based on their name.import globimport reregex = re.compile(r’w+’)# Get all of the file names in the utf_files directory,# assuming that the utf_files directory is in public_html/cs131a.files = glob.glob(‘./utf_files/*utf*’)# Process the files one by onefor file in sorted(files): # Write code to determine encoding # Open the file with the correct encoding # Get all the lines as a list # Print the name of the file print(‘n’ + (‘-‘ * 30)) print(‘File: ‘ + file_name) print(‘-‘ * 30) # Print the content of the file # Determine and print the number of chars in file # Print the number of lines in file # Print the number of words in the file # Display the 10 most frequent words and the number times they occur — this # probably won’t make any sense with pictographic languages. print(‘Ten Most Frequent Words’) . . .