[Python for Organic/Medicinal Chemistry] How You Can Play with Molecular Structures of Drugs in Python I. Drawing & Converting File Format
In this post, I will show you how to convert drug names into molecular structure, save into Excel, read into Python, display the structure, and convert file format.
The motivation was really to batch process a large number of organic molecules in a specific way to support drug discovery. In an effort to build high-throughput NMR (Nuclear Magnetic Resonance) capability, we are going to use either one of the two leading analytical data processing software programs from MestReNova or ACD/Labs for automated structure verification of small molecules using NMR & LC/MS data. This, of course, requires the input of molecular structure information, preferably in the MDL Molfile (.mol) format by vendors.
Notably, there is another format for molecular structure information, namely Simplified Molecular-Input Line-Entry System (SMILES). Different from the MDL Molfile format which is two-dimensional, SMILES is a one-dimensional string. Thus it is much more convenient to store and communicate for a large number of molecules in the SMILES format.
Let’s see Aspirin as an example (you can easily understand why SMILES format is preferred for database storage, a single line!):
As mentioned earlier, we need a .mol file to go with analytical data for automated structure verification. In the high-throughput mode, we are processing hundreds of samples every day. It will get very tedious quickly to prepare hundreds of .mol files. Luckily, we have an internal database (Oracle-based) where we may query structural information for a batch of compounds and download as SMILES strings into a CSV file via Python & SQL. I wish data processing vendors can take SMILES files as the input but it is not currently supported for every vendor. Instead, MDL Molfiles are preferred. As such, I need to convert the file format.
Of course, you may read in SMILES strings into ChemDraw and convert into Molfiles one-by-one. But that’s definitely not the approach for a data scientist. Luckily the RDKit package (RDKit: Open-Source Cheminformatics Software) turns out to be extremely handy!
I have two posts on LinkedIn: 1st edition & 2nd edition.
Let me show you how I did.
Step 1. Convert Name to Structure in ChemDraw; or Draw the Structure in the Preferred Software
[ChemDraw] Menu -> Structure → Convert Name to Structure (Shift + Ctrl + N)
Step 2. Copy Structure as SMILES to Excel or CSV file
A. Select the molecule and go to Menu → Edit → Copy As → SMILES (Alt + Ctrl + C)
B. Paste into CSV or Excel file
Step 3. Read into Jupyter Notebook using Pandas
Step 4. Display Chemical Structure with the Chem Module from RDKit
Step 5. Set up Directory, Convert Format, and Output Files
A. Set up directories
B. Convert format and export files
# initiate an empty list to accept all drug structure
structure_lst = []# iterate through the whole csv file
for i in range(drugs.shape[0]):
###################################################
# get the drug name
drug = drugs.loc[i, 'Drug Name']
print(f'\n\033[1m{drug}\033[0m:')# get the SMILES string
smiles = drugs.loc[i, 'SMILES']
# convert SMILES string to structure
structure = Chem.MolFromSmiles(smiles)
# append the structure to the list
structure_lst.append(structure)
###################################################
# configure the filename and path for .smi files
smi_file = drug + '.smi'
smi_path = os.path.join(smi_dir, smi_file)
###################################################
# configure the filename and path for .mol files
mol_file = drug + '.mol'
mol_path = os.path.join(mol_dir, mol_file)
###################################################
# configure the filename and path for .sdf files
sdf_file = drug + '.sdf'
sdf_path = os.path.join(sdf_dir, sdf_file)
###################################################
# configure the filename and path for .png files
img_file = drug + '.png'
img_path = os.path.join(img_dir, img_file)
###################################################
# iterate through .smi, .mol and .sdf files
for item in [smi_path, mol_path, sdf_path]:
try:
if item == smi_path:
f = open(item, "w+")
f.write(smiles)
f.close()
else:
w = Chem.SDWriter(item)
w.write(structure)
w.close()
print(f"\t.{item.rsplit('.',1)[1]} file saved")
except:
print(f"\t{item.rsplit('.',1)[1]} file failed")
pass# export individual png file
try:
Chem.Draw.MolToFile(structure, img_path, size = (200,200), legend = drug)
print('\t.png file saved')
except:
print('\tImage failed')
pass###################################################
# configure the filename and path for drug_all.sdf file
sdf_all_file = 'Drugs.sdf'
sdf_all_path = os.path.join(sdf_dir, sdf_all_file)try:
w = Chem.SDWriter(sdf_all_path)
for structure in structure_lst:
w.write(structure)
w.close()
print('\nA big sdf file saved for all drugs as "drug_all.sdf".')except:
print('\tBig sdf for all drugs failed')
pass
######################################################################################################
img_all_path = os.path.join(img_dir, 'drug_all.png')
# export a big png file for all drugs
try:
img_all = Chem.Draw.MolsToGridImage(structure_lst, molsPerRow=4, subImgSize=(200,200),legends = drugs['Drug Name'].tolist())
img_all.save(img_all_path)
print('\nA big image saved for all drugs as "drug_all.png".')
except:
print('\tBig image for all drugs failed')
passprint('\n\x1b[6;30;42m' + 'All DONE!' + '\x1b[0m')
Output:
This Jupyter Notebook is available on GitHub. Contact me on LinkedIn.