# Predicting optical spectra for optoelectronic polymers using coarse-grained models and recurrent neural networks

See allHide authors and affiliations

Contributed by Peter J. Rossky, April 19, 2020 (sent for review October 30, 2019; reviewed by Mark E. Tuckerman and Arieh Warshel)

## Significance

Coarse-graining of atomistic molecular models has become an essential tool for simulation of large molecular systems. However, analysis of the resulting structures can be limited by the ambiguity in the mapping from coarse-grained to atomistic structures. This is a fundamental problem when quantum properties are of interest because their direct calculation requires the full molecular configuration. Here, for conjugated polymers, we develop a machine-learning model relating coarse-grained conjugated polymer conformations and absorption spectroscopy and demonstrate that it can bridge between these. The results suggest that coarse-grained simulations aimed at elucidating the physics of organic optoelectronics can directly provide a useful description of electronic properties and open up the possibility of including spectroscopic information in the evaluation of coarse-grained models.

## Abstract

Coarse-grained modeling of conjugated polymers has become an increasingly popular route to investigate the physics of organic optoelectronic materials. While ultraviolet (UV)-vis spectroscopy remains one of the key experimental methods for the interrogation of these materials, a rigorous bridge between simulated coarse-grained structures and spectroscopy has not been established. Here, we address this challenge by developing a method that can predict spectra of conjugated polymers directly from coarse-grained representations while avoiding repetitive procedures such as *ad hoc* back-mapping from coarse-grained to atomistic representations followed by spectral computation using quantum chemistry. Our approach is based on a generative deep-learning model: the long-short-term memory recurrent neural network (LSTM-RNN). The latter is suggested by the apparent similarity between natural languages and the mathematical structure of perturbative expansions of, in our case, excited-state energies perturbed by conformational fluctuations. We also use this model to explore the level of sensitivity of spectra to the coarse-grained representation back-mapping protocol. Our approach presents a tool uniquely suited for improving postsimulation analysis protocols, as well as, potentially, for including spectral data as input in the refinement of coarse-grained potentials.

Simulation of coarse-grained (CG) molecular models is a rapidly growing approach for investigation of the physics of molecular assemblies comprising natural and synthetic materials (1⇓⇓⇓–5). This includes optoelectronically active organic polymers of interest in photovoltaic devices; for a recent review see ref. 6 and references therein. Structural CG reduces the computational load and increases the time step, thereby allowing theory to approach the time- and length scales that are relevant to material properties (7⇓–9). In particular, CG simulations have the potential to generate insights into the long sought after structure–function relationship in the active layer of organic solar cells. In order to realize this potential, however, an established relationship between the CG structures and their electronic properties is needed. Currently, this is done through “back-mapping” of the CG structures into atomistic representations using special purpose force fields (10) which serve as inputs for quantum chemistry. By its very nature, a back-mapping, or fine-graining, procedure cannot, in general, be rigorously or uniquely defined. Instead, it must be developed in a more or less *ad hoc* manner using physically motivated but, in general, uncontrolled approximations to add information to the model. In order to learn about the electronic properties of large aggregates of conjugated polymers from CG simulations, a new methodology which would bypass the back-mapping step is highly desirable. Here, we take a first step in this direction by developing a machine-learning (ML) method for predicting UV-vis absorption spectra directly from CG representations of conjugated polymers. This method is based on a deep-learning model used for modeling of sequential data such as natural language processing––the Long-Short-Term-Memory Recurrent Neural Net (LSTM-RNN) (11, 12). Fig. 1*A* provides a summary schematic of the approach described further below and in detail in *SI Appendix*.

The idea of drawing an analogy between the problem of predicting absorption spectra from incomplete structural input and the problem of language generation is a critical element of this work and is inspired by our chemical intuition and further justified by perturbation theory. The ordered sequence of dihedral twisting angles in a conjugated polymer are well understood to qualitatively define the length scale for confinement of electronic states and thus their quantum energy. Hence, the sequence of angles can be thought of as a molecular “word” or “sentence” that “describes” the spectral energy. Alternatively, if one considers the parameters of a quantum many-electron Hamiltonian for a potentially conjugated polymer structure, and then considers the mathematical structure of the perturbative corrections to the spectra of a perfectly conjugated system that would be associated with twist-angle fluctuations, the terms form a sequence of products of intermonomer couplings with varying lengths and compositions, much like the words or sentences of language can be viewed as sequences of letters or words of varying length and composition. We discuss the perturbative analogy in much more detail in *SI Appendix* for this contribution.

Our specific choice of ML model was motivated by the success of a particular flavor of recurrent neural networks in capturing long-range correlations in sequential data: in the past, LSTM-RNNs have been used successfully to capture the context, i.e., the long-range correlations, in sequences of words for the purposes of language generation. Here we utilize this property of LSTM-RNN’s to relate the “context” provided by the torsional conformations of poly-3-hexyl (P3HT) 30-mers to their absorption spectra. We train an LSTM-RNN (13) with one hidden layer and 150 “neurons”** **(14) (for details see *SI Appendix*) using a dataset which consists of [input,output] pairs: [the vector of 29 intermonomer dihedral angles ^{6} individual structures were sampled at 10-ps intervals and back-mapped into atomistic representations** **(14). The spectra for the individual atomistic polymers were obtained using an established semiempirical methodology based on an appropriately parameterized Pariser–Parr–Pople Hamiltonian solved within a Hartree–Fock + Configuration Interaction Singles approximation (15, 16). *SI Appendix* provides a more detailed discussion of the CG model, the back-mapping protocol, and the quantum chemistry, with references to more complete reports of the quantum-chemical methods. Fig. 1*B* summarizes the structure of the input–output dataset. The input consists of subsequences of the dihedral angles, as

The most systematic way to organize the input is by generating and including terms of the forms that appear in the perturbation expansion (*SI Appendix*). However, this results in very lengthy input sequences that need not actually benefit the training of our ML model. Instead, we study this dependence by including subsets of sequential torsional angles _{s} along the backbone of the polymer. We let the training process inform us of their individual importance and the importance of the correlations between them. Specifically for our model polymer, this means that the sequence of inputs from each conformer sampled consists of (29 − N_{s}) vectors *i* runs over the interring dihedrals one by one starting from 1 to 29 − N_{s}. The hyperparameter N_{s}, which should logically be related to the wavefunction localization length along the polymer backbone, is determined using physical intuition and through trial and error. Quite interestingly, the best predictive performance for the relatively localized first excited S_{1} state was obtained with N_{s} = 6, a value that is consistent with the observed spread of the S_{1} wavefunction over a polymer backbone in our previous work on this system (15, 17). However, small values of N_{s} (N_{s} ∼ 1) resulted in the best performance for the higher-energy states (S_{2} and higher) that exhibit a wavefunction that is typically delocalized over the entire 30-mer. This shows that the training processes are sensitive to the balance between the length of the input sequence _{s}. The digital energy output from the ML model is conveniently encoded into counting vectors on a discretized grid representation of the excitation energies (*SI Appendix*).

It is not uncommon for ML models to “learn” some parts of the dataset better than others (18). In order to determine if our dataset contains meaningful structure for which this might be the case, we have applied the *k*-means clustering algorithm (19) to the propensities of discretized values of *A* demonstrates that, using this criterion, the training data contain at least four distinct torsional subsets. Fig. 2*B* shows the energies for the ground-to-first excited S_{1} state transition (E_{01}) for each cluster. While there is considerable overlap between the clusters, it is important to note that the ML model correctly identifies the clearly observed shift in the energy gap distributions between subsets, reflecting a key sensitivity of the optical gaps to the torsional profile.

For each excited state, S_{1}, S_{2}, …, S_{5} LSTM-RNN models were trained and their predictions were averaged. Fig. 2*C* shows the correspondence between the model’s predictions and the true values of E_{01} obtained from a separate validation dataset. As before, we show the correspondence broken down into clusters for increased transparency––in this case, the performance on the four distinct clusters remains uniform. The Pearson coefficient was highest for the discretized E_{01}, ∼94%, and it deteriorated slightly for even higher-energy states E_{02–05}: 92–89%. Errors in the resulting absorption spectra at these higher energies have little impact since this contribution to the spectrum is strongly suppressed by the very low oscillator strengths associated with excitation into these delocalized states.

We have seen that the trained ML model can effectively recapitulate the full quantum-chemically calculated spectrum given a set of vectors of sequential intermonomer dihedral angles corresponding to the set of contributing structures. Further, it can correctly distinguish between populations that have distinct differences in their distributions of dihedral angles (Fig. 2*C*). We are now in a position to test the sensitivity of the ML model to how the input values are measured and how the inverse mapping is performed. Of course, any inverse-mapping procedure that yields the same set of dihedral angle sequence vectors as the CG model will lead the ML model to respond with the same spectrum. Here, we defer an extensive study of the impact of alternative CG models and back-mapping protocols. It is, for example, of interest to explore the variation in atomistic structures (and corresponding spectra) for a given set of constrained dihedral angles, the variation in structural metrics and spectra for alternative less restrained protocols, and the sensitivity associated with alternative definitions of CG and atomistic dihedral angles in models other than that considered here [7–9]. Rather, below, to demonstrate the importance of the issue, we briefly explore such implications for the present back-mapping protocol in the context of the present ML model

To this end, we use our CG simulation of P3HT 30-mers in explicit CG chlorobenzene (see Fig. 3*B*; note that the solvent is not shown) and consider 14,000 P3HT CG 30-mer configurations extracted from it. We then use our ML model to predict the spectrum in two different ways. The first is exactly the protocol already described, where the angle vectors input to the ML model are taken from back-mapped structures using the chemical definition of dihedrals used in molecular simulation codes (see *SI Appendix* for details). Thus, the data used in the training of the ML model and in the prediction correspond exactly in geometric definition. In the alternative, we take the torsion angle sets directly from the intermonomer CG dihedral angles (defined by the CG ring site-S vectors of consecutive CG monomers). These choices might superficially appear equivalent.

In Fig. 3*A*, we repeat the spectrum shown in Fig. 2*D* and superimpose the spectrum obtained from the CG angles that have not undergone any postprocessing. It is clear that the predictions of the spectrum are only similar; the blueshift of ∼20 nm using “similar” angles is readily resolved and would be of physical significance in a spectroscopic study. In hindsight, this spectral sensitivity to dihedral angles should not be completely surprising in light of the discrimination displayed by the ML model in Fig. 2. In *SI Appendix*, we provide a detailed description of the origin of the modest differences in dihedral angles as well as quantitative comparisons of their distributions. We recommend this section to readers who are interested in the challenge of structural back-mapping of CG models.

In summary, our approach provides quite strong evidence that an ML model of absorption spectra based only on basic conformational information can effectively discriminate among alternative spectral responses for conjugated polymers. In addition to providing a potentially valuable tool when coupled to CG models, this observation opens the door to including quantum-chemical or experimental spectroscopic information in the parameterization of CG potentials for the simulation of materials used in organic optoelectronics. In any case, we expect that the present demonstration can aid in making computational experiments that invoke CG models more efficient and make the conclusions more transparent and reliable.

## Materials and Methods

The LSTM-RNN was implemented using Tensorflow (TF) (13, 20) with the training performed using TF’s stochastic gradient descent algorithm on 2 NVIDIA Volta graphics processing units. The network architecture had one hidden layer and 150 neurons; the learning was done with the learning rate set to 0.05 in 30,000 training steps using minibatch size of 300. The size of the training dataset was 876,474 input/output pairs with every 10th data point set aside for validation. For details on the training dataset see *SI Appendix*.

### Data Availability.

The training data and the LSTM-RNN code are available for download from the open access repository https://figshare.com/articles/lstm_model_and_dataset_first_excited_state_/12089655/2.

## Acknowledgments

Access to computing time on Bridges GPU-AI granted by the Extreme Science and Engineering. Discovery Environment, which is supported by National Science Foundation Grant ACI-1548562, is acknowledged. Support for this research by a grant from the Welch Foundation (Grant C-1937-20170325) is gratefully acknowledged.

## Footnotes

↵

^{1}Present address: Department of Chemistry, McGill University, Montréal, QC H3A 0B8, Canada.↵

^{2}Present address: Intel Corporation, Ronler Acres MS:RA4-403, Hillsboro, OR 97124.- ↵
^{3}To whom correspondence may be addressed. Email: peter.rossky{at}rice.edu.

Author contributions: L.S., T.C.A., and P.J.R. designed research, performed research, analyzed data, and wrote the paper.

Reviewers: M.E.T., New York University; and A.W., University of Southern California.

The authors declare no competing interest.

Data deposition: The training data and the LSTM-RNN code are available for download from the open access repository https://figshare.com/articles/lstm_model_and_dataset_first_excited_state_/12089655/2.

This article contains supporting information online at https://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1918696117/-/DCSupplemental.

Published under the PNAS license.

## References

- ↵
- ↵
- ↵
- ↵
- J. Wang et al.

- ↵
- T. Lemke,
- C. Peter

- ↵
- K. Do,
- M. Kumar Ravva,
- T. Wang,
- J.-L. Brédas

- ↵
- K. N. Schwarz,
- T. W. Kee,
- D. M. Huang

- ↵
- R. Alessandri,
- J. J. Uusitalo,
- A. H. de Vries,
- R. W. A. Havenith,
- S. J. Marrink

- ↵
- S. E. Root,
- S. Savagatrup,
- C. J. Pais,
- G. Arya,
- D. J. Lipomi

- ↵
- K. H. DuBay et al.

- ↵
- ↵
- Z. Ghahramani,
- M. Welling,
- C. Cortes,
- N. D. Lawrence,
- K. Q. Weinberger

- I. Sutskever,
- O. Vinyals,
- Q. V. Le,

- ↵
- A. Damien

- ↵
- L. Simine

- ↵
- L. Simine,
- P. J. Rossky

- ↵
- ↵
- D. Raithel et al.

- ↵
- F. Musil,
- M. J. Willatt,
- M. A. Langovoy,
- M. Ceriotti

- ↵
- S. P. Lloyd

- ↵
- M. Abadi et al

## Citation Manager Formats

## Article Classifications

- Physical Sciences
- Chemistry