generateData(...)

[INCLUDE some of the “Notes” from the MSToolkit LondonR presentation.]

How it works

The generateData(...) function calls the low level generate data components to create sets of simulated data. The following components are called to create aspects of the simulated trial data:


The generateData(...) function iteratively builds and combines the data components for each replicate, and stores the data in the "ReplicateData" subdirectory of the working directory. This data can then be analyzed using a call to the analyzeData(...) function.

Arguments

The generateData(...) function takes a number of arguments which are passed down to the various lower level functions.

Required Arguments

Argument Name

Description

replicateN

Specifies how many replicates / simulated trials to generate

subjects

TOTAL number of subjects for the whole design. The default behaviour is to allocate subjects to each treatment with equal probability, which may not guarantee equal allocation. See treatSubj below for further details of treatment allocation methods.

treatDoses

Specifies the doses to be used in simulations. MSToolkit was designed to evaluate the operating characteristics of clinical trials, but its functionality can be extended to simulate non-clinical trials by thinking of "doses" as other factors which vary between individuals within a simulation replicate. 

In the generated dataset TRT defines the treatment arm to which a subject is allocated and for parallel group designs this has a corresponding, unique value of DOSE. However in the case of parallel group trials, TRT is the treatment sequence to which a subject is allocated and must have a corresponding treatment sequence.

respEqn

Specifies the linear predictor for generating outcome values. This should be a valid R expression or function. The expression can be written directly in generateData(...) or an R function defined outside of generateData(...) can be called. This function must return a vector of equal length to the number of rows in the generated data - one value per subject or one value per observation (TIME) within each subject.


Optional Arguments

treatSubj

treatProp

treatSubj specifies the precise number of subjects to allocate to each treatment (the sum of the elements of this vector must equal subjects above). treatProp is a vector of proportions specifying how subjects are to be allocated to each treatment in the proportions specified (the vector must be of the same length as the number of treatments - the length of treatDoses above and sum to 1). Specify only one of these arguments.

treatProp defines the probabilities of allocating each treatment and does not guarantee that the exact proportion will be allocated to a given treatment. treatSubj on the other hand allocates exactly the specified number to each treatment. If the sum of the number of subjects in treatSubj does not equal subjects (above) then the sum of treatSubj is used in place of subjects.

treatType

treatSeq

treatPeriod

If treatType is "crossover" then treatSeq should contain the treatment sequences for subjects to be allocated to. Each subject is then randomly allocated to one of the treatment sequences unless treatSubj is specified as above. 

treatPeriod defines the timing of observations / response values. If any times are less than zero then it is assumed that DOSE=0 for these measurements i.e. we assume a placebo run-in. For times greater than or equal to zero DOSE is as specified in treatDoses.

genParNames 

genParMean

genParVCov

genParNames defines the names to be used for the data generation model parameters for calculations and in the output dataset. genParMean and genParVCov define the mean value for these parameters and the variance-covariance matrix defining how these parameters will vary across trial replicates. By default we assume that genParVCov = 0 i.e. parameters have fixed values across trial replicates. See the Simulation overview page for more information.

genParBtwNames

genParBtwMean

genParBtwVCov

genParBtwCrit

genParErrStruc

These parameters define how between subject variability is to be included for the parameters used in respEqn. Variables defined in genParNames which also appear in genParBtwNames will have values generated from a (multivariate) Normal distribution with mean genParBtwMean and variance-covariance matrix genParBtwVCov. By default we assume that genParBtwMean = 0 for all parameters i.e. the parameters used in respEqn will have means specified by genParMean (with between replicate variability specified by genParVCov) and will vary between subjects with covariance genParBtwVCov. This process mirrors the usual hierarchical model construct with fixed and random effects. genParBtwCrit applies ranges to the values generated (similar to conCovCrit above). If genParErrStruc is specified as "additive" or "proportional" then the subject specific variation is added to the fixed effect values in an appropriate way. "additive" simply adds the values, while "proportional" adds the subject specific variation to the logged fixed effect value and then exponentiates. If genParErrStruc is "none" then the two values are returned separately to the generated dataset for the user to combine and use in an appropriate way.

respDist

respVCov

respInvLink

respErrStruc

respCrit

respDigits

These parameters define the distributional properties for the generated response variable. respEqn gives the linear predictor for response, defining how treatments, doses, covariates, time etc. relate to the mean response for an individual. This linear predictor can then be used within a normal distribution to define continuous response variables or, with the appropriate link function (specified in respInvLink), can be used with binomial or poisson distributions to create binary or count data. If we are creating continuous response outcomes then we can specify the residual (or within subject) variability, how this variability is added to the values from respEqn through respErrStruc and whether the generated residual values need to be constrained within certain ranges (given by respCrit). Finally we can specify the number of significant digits for the generated response. 

MSToolkit version 2.0 only uses 1 value for residual error, although future versions will extend this to allow multiple residual error parameters to be created.

interimSubj

interimSubj defines how subjects will be assigned to interim analysis data subsets. This should be a vector of cumulative proportions e.g. c(0.3,0.6) or c(0.25,0.5,0.75). MSToolkit will partition the dataset and allocate subjects randomly to one of the interim analysis subsets.

mcarProp

mcarRule

dropFun

These parameters define how missing data is to be generated and rules for dropping subjects. dropFun can be any valid R function and so can use dataset covariates, parameters and responses as drivers for the dropout function.

conCovNames

conCovMean

conCovVCov

conCovCrit

conCovMaxDraws

These parameters define how continuous covariates are to be generated across subjects within replicates. Values are drawn from (multivariate) Normal distributions. conCovCrit specifies ranges or criteria for each covariate value. If the number of draws from the distribution exceeds conCovMaxDraws before an acceptable value is found then a warning is given.

disCovNames

disCovVals

disCovProb

disCovProbArray

These parameters define how discrete covariates are to be generated across subjects within replicates. Values of the discrete parameters are specified in disCovVals and then these values are generated in proportions given by disCovProb or disCovProbArray if user wish to specify associations between discrete covariate values.

extCovNames

extCovFile

extCovSubset

extCovRefCol

extCovSameRow

extCovDataId

Covariate values can be sampled from an external file e.g. an existing database in an ASCII file. These parameters define which variables to sample from the external file, the name of that file and whether to subset the data in that file before sampling. Users can choose to bring into the generated dataset a reference variable identifying which rows of the external datafile have been sampled (in order to check data values). It is also possible to specify whether to sample covariate values independently (default) or whether to sample covariate values from within the same row of the external file, thus preserving correlations between covariates without making normality assumptions. If a value is given for extCovDataId then this is used to identify covariate values from each unique ID within the external datafile.